I would like to evaluate conditions within groups. While mtcars may not exactly match my data, here is my problem.
Let's to group mtcars by gear. Then I would like to get the subset of data, where within the gear group there is a row wehere 'carb' equals 1 and there is one where it is '4'. I want all the rows if there is a 1 + 4 and I would like to omit all the rows within the group if there isnt.
p <- arrange(mtcars, gear)
p <- filter(mtcars, carb == 1 & carb == 4)
This gives 0 obviously since there is not a single row where carb is has two values :)
The preferred outcome would be all the rows of mtcars where gear is 3 or 4. Omitting gear = 5 rows since within the group of gear 5, there isn't a carb == 1.
You can do:
mtcars %>%
group_by(gear) %>%
filter(any(carb == 1) & any(carb == 4))
Or:
mtcars %>%
group_by(gear) %>%
filter(all(c(1, 4) %in% carb))
An option with data.table
library(data.table)
as.data.table(mtcars)[, .SD[sum(c(1, 4) %in% carb)) == 2], gear]
Related
I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)
I'm a former SAS user and am trying to delete cases based on multiple conditions using the "filter" function in the tidyverse and am running into problems with the intersection of multiple conditions. In my reprex, I'm trying to delete cars (mtcars dataset) that have BOTH 8 cylinders AND a qsec 18 and over. In the mtcars dataset there are only 2 cars that meet both of these conditions (Merc 450SLC and Caddilac Fleetwood). There are 32 observations in the mtcars dataset, so the solution should return 30 observations based on the criteria.
I've tried using filter(cyl != 8, qsec < 18), but that deletes all 8 cylinder cars and all cars with qsecs 18 and above (resulting in only 5 observations). Using "&" yields the same result. Lots of googling didn't result in a solution, so any help is appreciated.
#Reprex:
filterdata <- mtcars %>%
filter(qsec < 18 & cyl != 8)
Thanks,
Wythe
This is what you want to remove :
library(dplyr)
mtcars %>% filter(cyl == 8 & qsec >= 18)
So you can negate the condition to get the rows that you want to keep.
mtcars %>% filter(!(cyl == 8 & qsec >= 18))
The approach which you were trying needed OR (|) condition.
mtcars %>% filter(cyl != 8 | qsec < 18)
PS - There is only one row which has qsec >= 18. Caddilac Fleetwood has qsec as 17.98.
In base R, we can use subset
subset(mtcars, !(cyl == 8 & qsec >= 18))
For example in the mtcars datasets I want to get the car with the max HP and has 6 cyl and gears so I run the code and get the value but the name of the car isnt showing up, how do you I change that?
+ filter(cyl=="6" & gear=="4") %>%
+ select(hp)
> v=mtcars %>%
+ filter(cyl=="6" & gear=="4") %>%
+ select(hp) %>%
+ summarise(max(hp))
This is the result
> v
max(hp)
1 123
Using base R functions with the help of magrittr's pipe operator for readability.
library(magrittr)
mtcars %>%
subset(cyl == 6 & gear == 4) %>%
subset(hp == max(hp)) %>%
rownames()
#[1] "Merc 280" "Merc 280C"
Note that the max hp value is same for above two cars so it returns both the car names.
We can create a column of rownames as the tidyverse would strip off the rownames attributes
library(dplyr)
mtcars %>%
rownames_to_column('rn') %>%
filter(cyl=="6" & gear=="4") %>%
select(rn, hp) %>%
slice(which.max(hp))
In this particular case, it is the summarise step removes the rownames. According to ?summarise
Data frame attributes are not preserved, because summarise() fundamentally creates a new data frame.
I want to create a new column (T/F) based on any value from a list being present in multiple columns. For this example, I'm using mtcars for my example, searching for two values in two columns, but my actual challenge is many values in many columns.
I have a successful filter using filter_at() included below, but I've been unable to apply that logic to a mutate:
# there are 7 cars with 6 cyl
mtcars %>%
filter(cyl == 6)
# there are 2 cars with 19.2 mpg, one with 6 cyl, one with 8
mtcars %>%
filter(mpg == 19.2)
# there are 8 rows with either.
# these are the rows I want as TRUE
mtcars %>%
filter(mpg == 19.2 | cyl == 6)
# set the cols to look at
mtcars_cols <- mtcars %>%
select(matches('^(mp|cy)')) %>% names()
# set the values to look at
mtcars_numbs <- c(19.2, 6)
# result is 8 vars with either value in either col.
# this is a successful filter of the data
out1 <- mtcars %>%
filter_at(vars(mtcars_cols), any_vars(
. %in% mtcars_numbs
)
)
# shows set with all 6 cyl, plus one 8cyl 21.9 mpg
out1 %>%
select(mpg, cyl)
# This attempts to apply the filter list to the cols,
# but I only get 6 rows as True
# I tried to change == to %in& but that results in an error
out2 <- mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs) > 0
)
# only 6 rows returned
out2 %>%
filter(myset == T)
I'm not sure why the two rows are skipped. I think it might be the use of rowSums that is aggregating those two rows in some way.
If we want to do the corresponding checks, it may be better to use map2
library(dplyr)
library(purrr)
map2_df(mtcars_cols, mtcars_numbs, ~
mtcars %>%
filter(!! rlang::sym(.x) == .y)) %>%
distinct
NOTE: Doing the comparison (==) with floating point numbers can get into trouble as the precision can vary and result in FALSE
Also, note that == works only when when either the lhs and rhs elements have the same length or the rhs vector is of length 1 (here the recycling happens). If the length is greater than 1 and not equal to length of lhs vector, then the recycling would be comparing in the column order.
We can replicate to make the lengths equal and now it should work
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs[col(select(., mtcars_cols))]) > 0
) %>% pull(myset) %>% sum
#[1] 8
In the above code select is used twice for better understanding. Otherwise, we can also use rep
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == rep(mtcars_numbs, each = n())) > 0
) %>%
pull(myset) %>%
sum
#[1] 8
I'd like to apply a transformation to all columns via dplyr::mutate_each, e.g.
library(dplyr)
mult <- function(x,m) return(x*m)
mtcars %>% mutate_each(funs(mult(.,2))) # Multiply all columns by a factor of two
However, the transformation should have parameters depending on the column name. Therefore, the column name should be passed to the function as an additional argument
named.mult <- function(x,colname) return(x*param.A[[colname]])
Example: multiply every column by a different factor:
param.A <- c()
param.A[names(mtcars)] <- seq(length(names(mtcars)))
param.A
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 2 3 4 5 6 7 8 9 10 11
Since the column name gets lost during mutate_each, I currently work around this by passing a list with lazy evalution to mutate_ (the SE version):
library(lazyeval)
named.mutate <- function(fun, cols) sapply(cols, function(n) interp(~fun(col, n), fun=fun, col=as.name(n)))
mtcars %>% mutate_(.dots=named.mutate(named.mult, names(.)))
Works, but is there some special variable like .name which contains the column name of . for each colwise execution? So I could do something like
mtcars %>% mutate_each(funs(named.mult(.,.name)))
I'd suggest taking a different approach. Instead of using mutate_each a combination of dplyr::mutate with tidyr::gather and tidyr::spread can achieve the same result.
For example:
library(dplyr)
library(tidyr)
data(mtcars)
# Multiple each column by a different interger
mtcars %>%
dplyr::tbl_df() %>%
dplyr::mutate(make_and_model = rownames(mtcars)) %>%
tidyr::gather(key, value, -make_and_model) %>%
dplyr::mutate(m = as.integer(factor(key)), # a multiplication factor dependent on column name
value = value * m) %>%
dplyr::select(-m) %>%
tidyr::spread(key, value)
# compare to the original data
mtcars[order(rownames(mtcars)), order(names(mtcars))]
# the muliplicative values used.
mtcars %>%
tidyr::gather() %>%
dplyr::mutate(m = as.integer(factor(key))) %>%
dplyr::select(-value) %>%
dplyr::distinct()