Subsetting list by values in a column in r - r

I want to subset a list of dataframes so that it returns the list in the same structure, but excluding rows from each dataframe that meet a condition in one column.
Say I have the following list:
col1<- round(rnorm(5, mean = 5), digits = 0)
col2<- round(rnorm(5, mean = 5), digits = 0)
col3<- round(rnorm(5, mean = 5), digits = 0)
a <- data.frame(col1, col2, col3)
col1<- round(rnorm(5, mean = 5), digits = 0)
col2<- round(rnorm(5, mean = 5), digits = 0)
col3<- round(rnorm(5, mean = 5), digits = 0)
b <- data.frame(col1, col2, col3)
col1<- round(rnorm(5, mean = 5), digits = 0)
col2<- round(rnorm(5, mean = 5), digits = 0)
col3<- round(rnorm(5, mean = 5), digits = 0)
c <- data.frame(col1, col2, col3)
my_list <- list(a,b,c)
names(my_list)<-c("df1", "df2", "df3")
This provides a list:
> my_list
$df1
col1 col2 col3
1 3 6 5
2 5 4 4
3 6 5 6
4 5 3 6
5 4 4 4
$df2
col1 col2 col3
1 6 5 5
2 6 5 5
3 5 6 6
4 5 4 5
5 6 5 5
$df3
col1 col2 col3
1 6 7 5
2 6 5 5
3 5 6 4
4 4 6 5
5 5 6 4
Say I want to remove all rows that have values in col3 that are less than 5 producing:
> my_list
$df1
col1 col2 col3
1 3 6 5
3 6 5 6
4 5 3 6
$df2
col1 col2 col3
1 6 5 5
2 6 5 5
3 5 6 6
4 5 4 5
5 6 5 5
$df3
col1 col2 col3
1 6 7 5
2 6 5 5
4 4 6 5
I have tried using lapply to no avail:
result <- lapply(my_list, function(x) {
return(x[x$'col3' < 5])
}
)
> result
$df1
[1] FALSE TRUE FALSE FALSE TRUE
$df2
[1] FALSE FALSE FALSE FALSE FALSE
$df3
[1] FALSE FALSE TRUE FALSE TRUE
Any help would be greatly appreciated!

base
set.seed(1)
l <- lapply(my_list, function(x) subset(x, col3 >= 5))
l
#> $df1
#> col1 col2 col3
#> 1 5 5 5
#> 2 5 5 5
#> 3 4 4 5
#>
#> $df2
#> col1 col2 col3
#> 1 6 5 7
#> 2 3 6 5
#> 4 5 5 5
#>
#> $df3
#> col1 col2 col3
#> 4 4 5 7
#> 5 7 4 7
do.call(rbind, l)
#> col1 col2 col3
#> df1.1 5 5 5
#> df1.2 5 5 5
#> df1.3 4 4 5
#> df2.1 6 5 7
#> df2.2 3 6 5
#> df2.4 5 5 5
#> df3.4 4 5 7
#> df3.5 7 4 7
Created on 2021-02-05 by the reprex package (v1.0.0)

Here's a tidyverse solution:
library(tidyverse)
result <- function(x) {
x %>%
filter(col3 < 6)
}
map(my_list, result)
This returns a list of data.frames where col3 is less than 6.
$df1
col1 col2 col3
1 5 4 4
2 4 4 4
$df2
col1 col2 col3
1 6 7 5
$df3
col1 col2 col3
1 6 5 5
2 5 5 5
3 5 5 3
You can combine into a single data.frame by using map_df:
map_df(my_list, result)
This gives us:
> map_df(my_list, result)
col1 col2 col3
1 5 4 4
2 4 4 4
3 6 7 5
4 6 5 5
5 5 5 5
6 5 5 3

Related

Dataframe from a vector and a list of vectors by replicating elements

I have a vector and list of the same length. The list contains vectors of arbitrary lengths as such:
vec1 <- c("a", "b", "c")
list1 <- list(c(1, 3, 2),
c(4, 5, 8, 9),
c(5, 2))
What is the fastest, most effective way to create a dataframe such that the elements of vec1 are replicated the number of times corresponding to their index in list1?
Expected output:
# col1 col2
# 1 a 1
# 2 a 3
# 3 a 2
# 4 b 4
# 5 b 5
# 6 b 8
# 7 b 9
# 8 c 5
# 9 c 2
I have included a tidy solution as an answer, but I was wondering if there are other ways to approach this task.
In base R, set the names of the list with 'vec1' and use stack to return a two column data.frame
stack(setNames(list1, vec1))[2:1]
-output
ind values
1 a 1
2 a 3
3 a 2
4 b 4
5 b 5
6 b 8
7 b 9
8 c 5
9 c 2
If we want a tidyverse approach, use enframe
library(tibble)
library(dplyr)
library(tidyr)
list1 %>%
set_names(vec1) %>%
enframe(name = 'col1', value = 'col2') %>%
unnest(col2)
# A tibble: 9 × 2
col1 col2
<chr> <dbl>
1 a 1
2 a 3
3 a 2
4 b 4
5 b 5
6 b 8
7 b 9
8 c 5
9 c 2
This tidy solution replicates the vec1 elements according to the nested vector's lengths, then flattens both lists into a tibble.
library(purrr)
library(tibble)
tibble(col1 = flatten_chr(map2(vec1, map_int(list1, length), function(x, y) rep(x, times = y))),
col2 = flatten_dbl(list1))
# # A tibble: 9 × 2
# col1 col2
# <chr> <dbl>
# 1 a 1
# 2 a 3
# 3 a 2
# 4 b 4
# 5 b 5
# 6 b 8
# 7 b 9
# 8 c 5
# 9 c 2
A tidyr/tibble-approach could also be unnest_longer:
library(dplyr)
library(tidyr)
tibble(vec1, list1) |>
unnest_longer(list1)
Output:
# A tibble: 9 × 2
vec1 list1
<chr> <dbl>
1 a 1
2 a 3
3 a 2
4 b 4
5 b 5
6 b 8
7 b 9
8 c 5
9 c 2
Another possible solution, based on purrr::map2_dfr:
library(purrr)
map2_dfr(vec1, list1, ~ data.frame(col1 = .x, col2 =.y))
#> col1 col2
#> 1 a 1
#> 2 a 3
#> 3 a 2
#> 4 b 4
#> 5 b 5
#> 6 b 8
#> 7 b 9
#> 8 c 5
#> 9 c 2

R way to select unique columns from column name?

my issue is I have a big database of 283 columns, some of which have the same name (for example, "uncultured").
Is there a way to select columns avoiding those with repeated names? Those (bacteria) normally have a very small abundance, so I don't really care for their contribution, I'd just like to take the columns with unique names.
My database is something like
Samples col1 col2 col3 col4 col2 col1....
S1
S2
S3
...
and I'd like to select every column but the second col2 and col1.
Thanks!
Something like this should work:
df[, !duplicated(colnames(df))]
Like this you will automatically select the first column with a unique name:
df[unique(colnames(df))]
#> col1 col2 col3 col4 S1 S2 S3
#> 1 1 2 3 4 7 8 9
#> 2 1 2 3 4 7 8 9
#> 3 1 2 3 4 7 8 9
#> 4 1 2 3 4 7 8 9
#> 5 1 2 3 4 7 8 9
Reproducible example
df is defined as:
df <- as.data.frame(matrix(rep(1:9, 5), ncol = 9, byrow = TRUE))
colnames(df) <- c("col1", "col2", "col3", "col4", "col2", "col1", "S1", "S2", "S3")
df
#> col1 col2 col3 col4 col2 col1 S1 S2 S3
#> 1 1 2 3 4 5 6 7 8 9
#> 2 1 2 3 4 5 6 7 8 9
#> 3 1 2 3 4 5 6 7 8 9
#> 4 1 2 3 4 5 6 7 8 9
#> 5 1 2 3 4 5 6 7 8 9

Subsetting from a list of dataframes in R

I have a list of dataframes:
df1 <- data.frame(c(1:5), c(6:10))
df2 <- data.frame(c(1:7))
df3 <- data.frame(c(1:5), c("a", "b", "c", "d", "e"))
my_list <- list(df1, df2, df3)
From my_list, I want to extract the data frames which have only 2 columns (df1 and df3), and put them in a new list.
Maybe you can try lengths
> my_list[lengths(my_list) == 2]
[[1]]
c.1.5. c.6.10.
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c.1.5. c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
It's also possible to subset using lapply and a logical condition (sapply will also work):
my_list[lapply(my_list, ncol) == 2]
[[1]]
c.1.5. c.6.10.
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c.1.5. c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
We could use keep from purrr package with the condition:
library(purrr)
my_list %>% keep(~ ncol(.x) == 2)
[[1]]
c.1.5. c.6.10.
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
[[2]]
c.1.5. c..a....b....c....d....e..
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

Creating a new column based on other columns in a dataframe R

I have a dataframe that looks like this:
df <- data.frame('col1'=c(1,2,2,4,5), 'col2'=c(4,9,3,5,13), 'col3'=c(3,5,8,7,10))
> df
col1 col2 col3
1 1 4 3
2 2 9 5
3 2 3 8
4 4 5 7
5 5 13 10
I want to create a new column that has a value of 1 if at least one of the values in the row is greater or equal to 8 and a value of 0 if all of the values in the row are less than 8. So the final result would look something like this:
> df
col1 col2 col3 new
1 1 4 3 0
2 2 9 5 1
3 2 3 8 1
4 4 5 7 0
5 5 13 10 1
Thanks!
This works:
df$new <- apply(df, 1, function(x) max(x >= 8))
df
# col1 col2 col3 new
# 1 1 4 3 0
# 2 2 9 5 1
# 3 2 3 8 1
# 4 4 5 7 0
# 5 5 13 10 1
Using rowSums.
df$new <- +(rowSums(df>=8, na.rm=TRUE) > 0); df
col1 col2 col3 new
1 1 4 3 0
2 2 9 5 1
3 2 3 8 1
4 4 5 7 0
5 5 13 10 1
Alternatively using matrix multiplication
df$new <- as.numeric(((df >= 8) %*% rep(1, ncol(df))) > 0)
df
col1 col2 col3 new
1 1 4 3 0
2 2 9 5 1
3 2 3 8 1
4 4 5 7 0
5 5 13 10 1
# Or logical column
df$new <- ((df >= 8) %*% rep(1, ncol(df))) > 0
df
col1 col2 col3 new
1 1 4 3 FALSE
2 2 9 5 TRUE
3 2 3 8 TRUE
4 4 5 7 FALSE
5 5 13 10 TRUE

I need Column sum as sum12 is sum of 1st columns and sum34 is of column 3 and 4

col1 col2 col3 col4
1 4 1 4
2 4 2 5
4 5 3 6
5 6 5 7
I need column sum like
col1 col2 col3 col4 sum12 sum34
1 4 1 4 5 5
2 4 2 5 6 7
4 5 3 6 9 9
5 6 5 7 11 12
We can use transform
transform(df, sum12 = col1 + col2, sum34 = col3 + col4)
Or another option is
df[c("sum12", "sum34")] <- df[c(1,3)] + df[c(2,4)]
df
# col1 col2 col3 col4 sum12 sum34
#1 1 4 1 4 5 5
#2 2 4 2 5 6 7
#3 4 5 3 6 9 9
#4 5 6 5 7 11 12

Resources