How can I select columns based on two conditions? [duplicate] - r

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 3 years ago.
I have a data frame with lots of columns. For example:
sample treatment col5 col6 col7
1 a 3 0 5
2 a 1 0 3
3 a 0 0 2
4 b 0 1 1
I want to select the sample and treatment columns plus all columns that meet the following 2 conditions:
Their value on the row in which treatment == 'b' is 0
Their value from at least one row where treatment == 'a' is not 0.
The expected result should look like this:
sample treatment col5
1 a 3
2 a 1
3 a 0
4 b 0
Example dataframe:
structure(list(sample = 1:4, treatment = structure(c(1L, 1L,
1L, 2L), .Label = c("a", "b"), class = "factor"), col5 = c(3,
1, 0, 0), col6 = c(0, 0, 0, 1), col7 = c(5, 3, 2, 1)), class = "data.frame", row.names = c(NA,
-4L))

Here's a way in base R -
cs_a <- colSums(df[df$treatment == "a",-c(1:2)]) > 0
cs_b <- colSums(df[df$treatment == "b",-c(1:2)]) == 0
df[, c(TRUE, TRUE, cs_a & cs_b)]
sample treatment col5
1 1 a 3
2 2 a 1
3 3 a 0
4 4 b 0
With dplyr -
df %>%
select_at(which(c(TRUE, TRUE, cs_a & cs_b)))

Here is much more verbose way in tidyverse that does not require manual colSums for each level of treatment:
library(dplyr)
library(purrr)
library(tidyr)
sample <- 1:4
treatment <- c("a", "a", "a", "b")
col5 <- c(3,1,0,0)
col6 <- c(0,0,0,1)
col7 <- c(5,3,2,1)
dd <- data.frame(sample, treatment, col5, col6, col7)
# first create new columns that report whether the entries are zero
dd2 <- mutate_if(
.tbl = dd,
.predicate = is.numeric,
.funs = function(x)
x == 0
)
# then find the sum per column and per treatment group
# in R TRUE = 1 and FALSE = 0
number_of_zeros <- dd2 %>%
group_by(treatment) %>%
summarise_at(.vars = vars(col5:col7), .funs = "sum")
# then find the names of the columns you want to keep
keeper_columns <-
number_of_zeros %>%
select(-treatment) %>% # remove the treatment grouping variable
map_dfr( # function to check if all entries per column (now per treatment level) are greater zero
.x = .,
.f = function(x)
all(x > 0)
) %>%
gather(column, keeper) %>% # reformat
filter(keeper == TRUE) %>% # to grab the keepers
select(column) %>% # then select the column with column names
unlist %>% # and convert to character vector
unname
# subset the original dataset for the wanted columns
wanted_columns <- dd %>% select(1:2, keeper_columns)

Related

R- filter rows depending on value range across several columns

I have 5 columns with numerical data and I would like to filter for rows that match a data range in at least 3 of the 5 columns.
For example i have the following data frame and I define a value range of 5-10.
My first row has 3 columns with values between 5 and 10, so i want to keep that row.
The second row only has 2 values between 5 and 10, so I want to remove it.
column1
column2
column3
column4
column5
7
4
10
9
2
4
8
2
6
2
First test if values in columns are greater or equal 5 and less or equal than 10, then look for rows with 3 or more that fit the condition.
dat[ rowSums( dat >= 5 & dat <= 10 ) >= 3, ]
column1 column2 column3 column4 column5
1 7 4 10 9 2
Data
dat <- structure(list(column1 = c(7L, 4L), column2 = c(4L, 8L), column3 = c(10L,
2L), column4 = c(9L, 6L), column5 = c(2, 2)), class = "data.frame", row.names = c(NA,
-2L))
I'd like to share a second approach:
# Setting up data
my_df <- tibble::tibble(A = c(7,4), B = c(4,8), C = c(10, 2), D = c(9,6), E = c(2,2), X = c("some", "character"))
my_min <- 5
my_max <- 10
Then do some tidyverse-magic:
# This is verbose, but shows clearly all the steps involved:
my_df_filtered <- my_df %>%
dplyr::mutate(n_cols_in_range = dplyr::across(where(is.numeric), ~ .x >= my_min & .x <= my_max)
) %>%
dplyr::rowwise() %>%
dplyr::mutate(n_cols_in_range = sum(n_cols_in_range, na.rm = TRUE)
) %>%
dplyr::filter(n_cols_in_range >= 3
) %>%
dplyr::select(-n_cols_in_range)
The above is equivalent to:
my_df_filtered <- my_df %>%
dplyr::rowwise() %>%
dplyr::filter(sum(dplyr::across(where(is.numeric), ~ .x >= my_min & .x <= my_max), na.rm = TRUE) >= 3)
But I must state, that the above answer is clearly more elegant since it only needs 1 line of code!

R: count times per column a condition is met and row names appear in a list

I have a dataframe with count information (df1)
rownames
sample1
sample2
sample3
m1
0
5
1
m2
1
7
5
m3
6
2
0
m4
3
1
0
and a second with sample information (df2)
rownames
batch
total count
sample1
a
10
sample2
b
15
sample3
a
6
I also have two lists with information about the m values (could easily be turned into another data frame if necessary but I would rather not add to the count information as it is quite large). No patterns (such as even and odd) exist, I am just using a very simplistic example
x <- c("m1", "m3") and y <- c("m2", "m4")
What I would like to do is add another two columns to the sample information. This is a count of each m per sample that has a value of above 5 and appears in list x or y
rownames
batch
total count
x
y
sample1
a
10
1
0
sample2
b
15
1
1
sample3
a
6
0
1
My current strategy is to make a list of values for both x and y and then append them to df2. Here are my attempts so far:
numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) and numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) both return a list of 0s
numX <- colSums(df1[rownames(df1)>10 %in% x,]) returns a list of the sum of count values meeting the conditions for each column
numX <- length(df1[rownames(df1)>10 %in% novel,]) returns the number of times the condition is met (in this example 2L)
I am not really sure how to approach this so I have just been throwing around attempts. I've tried looking for answers but maybe I am just struggling to find the proper wording.
We may do this with rowwise
library(dplyr)
df2 %>%
rowwise %>%
mutate(x = +(sum(df1[[rownames]][df1$rownames %in% x]) >= 5),
y = +(sum(df1[[rownames]][df1$rownames %in% y]) >= 5)) %>%
ungroup
-output
# A tibble: 3 × 5
rownames batch totalcount x y
<chr> <chr> <int> <int> <int>
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
Or based on the data, a base R option would be
out <- aggregate(. ~ grp, FUN = sum,
transform(df1, grp = c('x', 'y')[1 + (rownames %in% y)] )[-1])
df2[out$grp] <- +(t(out[-1]) >= 5)
-output
> df2
rownames batch totalcount x y
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
data
df1 <- structure(list(rownames = c("m1", "m2", "m3", "m4"), sample1 = c(0L,
1L, 6L, 3L), sample2 = c(5L, 7L, 2L, 1L), sample3 = c(1L, 5L,
0L, 0L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(rownames = c("sample1", "sample2", "sample3"),
batch = c("a", "b", "a"), totalcount = c(10L, 15L, 6L)),
class = "data.frame", row.names = c(NA,
-3L))
How about using using dplyr and reshape2::melt
df3 <- df1 %>%
melt %>%
filter(value >= 5) %>%
mutate(x = as.numeric(rownames %in% c("m1", "m3")),
y = as.numeric(rownames %in% c("m2", "m4"))) %>%
select(-rownames, - value) %>%
group_by(variable) %>%
summarise(x = sum(x), y = sum(y))
df2 %>% left_join(df3, by = c("rownames" = "variable"))
rownames batch total_count x y
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
You can create a named list of vectors and for each rownames count how many values of x and y in the respective sample is >= 5.
Base R option -
list_vec <- list(x = x, y = y)
cbind(df2, do.call(rbind, lapply(df2$rownames, function(x)
sapply(list_vec, function(y) {
sum(df1[[x]][df1$rownames %in% y] >= 5)
}))))
# rownames batch total.count x y
#1 sample1 a 10 1 0
#2 sample2 b 15 1 1
#3 sample3 a 6 0 1
Using tidyverse -
library(dplyr)
library(purrr)
list_vec <- lst(x, y)
df2 %>%
bind_cols(map_df(df2$rownames, function(x)
map(list_vec, ~sum(df1[[x]][df1$rownames %in% .x] >= 5))))

Join of column values for specific row values

I'd like to join (left_join) a tibble (df2) to another one (df1) only where the value of col2 in df1 is NA. I am currently using a code that is not very elegant. Any advice on how to shorten the code would be greatly appreciated!
library(tidyverse)
# df1 contains NAs that need to be replaced by values from df2, for relevant col1 values
df1 <- tibble(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, NA, NA), col3 = c(10, 20, 30, 40))
df2 <- tibble(col1 = c("a", "b", "c", "d"), col2 = c(5, 6, 7, 8), col3 = c(50, 60, 70, 80))
# my current approach
df3 <- df1 %>%
filter(!is.na(col2))
df4 <- df1 %>%
filter(is.na(col2)) %>%
select(col1)%>%
left_join(df2)
# output tibble that is expected
df_final <- df3 %>%
bind_rows(df4)
Here's a small dplyr answer that works for me, although it might get slow if you have tons of rows:
df1 %>%
filter(is.na(col2)) %>%
select(col1) %>%
left_join(df2, by = "col1") %>%
bind_rows(df1, .) %>%
filter(!is.na(col2))
We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), col2 := fcoalesce(col2, i.col2), on = .(col1)]
-output
> df1
col1 col2 col3
1: a 1 10
2: b 2 20
3: c 7 30
4: d 8 40
Or an option with tidyverse
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = c("col1")) %>%
transmute(col1, across(ends_with(".x"),
~ coalesce(., get(str_replace(cur_column(), ".x", ".y"))),
.names = "{str_remove(.col, '.x')}"))
-output
# A tibble: 4 x 3
col1 col2 col3
<chr> <dbl> <dbl>
1 a 1 10
2 b 2 20
3 c 7 30
4 d 8 40

Data frames in R: Calculating average of rows in a data frame while ignoring entries with '0' values

Let's say in the R environment, I have this data frame with n rows:
a b c classes
1 2 0 a
0 0 2 b
0 1 0 c
The result that I am looking for is:
1. Get the number of non-zero values in each row
size_of_a = 2
average_of_a = 1.5
size_of_b= 1
average_of_b= 2
.
the same for the other rows
I have tried rowSums(dt[-c(4)]!=0)for finding the non zero elements, but I can't be sure that the 'classes column' will be the 4th column.
I would appreciate your help with acquiring these results.
Thanks
First, I create the data frame.
df <- read.table(text = "a b c classes
1 2 0 a
0 0 2 b
0 1 0 c", header = TRUE)
Then, I replace zeros with NAs to make life easier, since functions often have na.rm to ignore them.
df[df==0] <- NA
Finally, I bind together the sum of non-zero elements, the mean values, and the class names into a data frame.
data.frame(classes = df[,4],
size = rowSums(df[, -4]>0, na.rm = TRUE),
mean = rowMeans(df[, -4], na.rm = TRUE))
which gives,
# classes size mean
# 1 a 2 1.5
# 2 b 1 2.0
# 3 c 1 1.0
Edit
data.frame(classes = df[,"classes"],
size = rowSums(df[, names(df) != "classes"]>0, na.rm = TRUE),
mean = rowMeans(df[, names(df) != "classes"], na.rm = TRUE))
# classes size mean
# 1 a 2 1.5
# 2 b 1 2.0
# 3 c 1 1.0
You can do it with
# Generate some fake data
set.seed(1)
n = 10
k = 5
x = matrix(runif(n * k), n, k)
x[x < 0.5] = 0
# Get number of nonzero entries in each row
nonzeros = apply(x, 1, function(z) sum(z != 0))
# Take row sums and divide by number of non-zero entries
rowSums(x) / nonzeros
Or, using the data.frame you provided, it would look like this
# The data
x = structure(list(a = c(1L, 0L, 0L), b = c(2L, 0L, 1L), c = c(0L,
2L, 0L), classes = structure(1:3, .Label = c("a", "b", "c"), class = "factor")), .Names = c("a",
"b", "c", "classes"), class = "data.frame", row.names = c(NA,
-3L))
column = which(names(x) == "classes")
nonzeros = apply(x[-column], 1, function(z) sum(z != 0))
rowSums(x[-column]) / nonzeros
Another syntax to create dataframe using tibble function from dplyr library:
library(dplyr)
df <-
tibble(
a = c(1,0,0),
b = c(2,0,1),
c = c(0,2,0),
classes = c("a", "b", "c")
)
To count the elements in a row that are equal to zero, you can evaluate the whole row even when column classes is not numeric
rowSums( df == 0 )
Conversely, the number of elements different from zero in the whole row can be calculated through rowSums( df != 0 ).
Therefore, the average you are looking for is:
rowSums( df[ , 1:3] )/rowSums( df[ ,1:3] != 0 )
Cheers!

Looping through columns and duplicating data in R

I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3

Resources