R multiple condition filtering/deleting - r

I'm a former SAS user and am trying to delete cases based on multiple conditions using the "filter" function in the tidyverse and am running into problems with the intersection of multiple conditions. In my reprex, I'm trying to delete cars (mtcars dataset) that have BOTH 8 cylinders AND a qsec 18 and over. In the mtcars dataset there are only 2 cars that meet both of these conditions (Merc 450SLC and Caddilac Fleetwood). There are 32 observations in the mtcars dataset, so the solution should return 30 observations based on the criteria.
I've tried using filter(cyl != 8, qsec < 18), but that deletes all 8 cylinder cars and all cars with qsecs 18 and above (resulting in only 5 observations). Using "&" yields the same result. Lots of googling didn't result in a solution, so any help is appreciated.
#Reprex:
filterdata <- mtcars %>%
filter(qsec < 18 & cyl != 8)
Thanks,
Wythe

This is what you want to remove :
library(dplyr)
mtcars %>% filter(cyl == 8 & qsec >= 18)
So you can negate the condition to get the rows that you want to keep.
mtcars %>% filter(!(cyl == 8 & qsec >= 18))
The approach which you were trying needed OR (|) condition.
mtcars %>% filter(cyl != 8 | qsec < 18)
PS - There is only one row which has qsec >= 18. Caddilac Fleetwood has qsec as 17.98.

In base R, we can use subset
subset(mtcars, !(cyl == 8 & qsec >= 18))

Related

Mapping pipes to multiple columns in tidyverse

I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)

Sort each column in a data frame and get common in the top 50 percentage

I have a file format like this
data<-mtcars[,c("mpg","drat","wt")]
data<- tibble::rownames_to_column(data, "Names")
I need to sort (Ascending) all the columns and filter to the top 50 %.
Then from this data, I need to get the common one.
For this purpose, I am using the following code
mpg<-data %>% dplyr::select(Names,mpg)%>%mutate_if(is.numeric, round,digits=3) %>% arrange(desc(mpg))%>% filter(row_number() / n() <= .5)
drat<-data %>% dplyr::select(Names,drat)%>%mutate_if(is.numeric, round,digits=3) %>% arrange(desc(drat))%>% filter(row_number() / n() <= .5)
wt<-data %>% dplyr::select(Names,wt)%>%mutate_if(is.numeric, round,digits=3) %>% arrange(desc(wt))%>% filter(row_number() / n() <= .5)
The code above will sort and filter the top 50 percentage
join_all(list(mpg,drat,wt),by = 'Names', type = 'inner')
This will print the common in the top 50 percent from all columns.
Is there any way to do this via some package or function like a single line?
Sure you can. Here's a single-line dplyr solution:
library(dplyr)
filter(data, if_all(c(mpg, drat, wt), ~rank(-., ties.method = "first")/n() <= 0.5))
#> Names mpg drat wt
#> 1 Merc 280 19.2 3.92 3.44
Simply use
data %>% filter(mpg>=median(mpg) & drat>=median(drat) & wt >= median(wt))
Names mpg drat wt
1 Merc 280 19.2 3.92 3.44

In dplyr when summarising after filtering the row name is not showing up

For example in the mtcars datasets I want to get the car with the max HP and has 6 cyl and gears so I run the code and get the value but the name of the car isnt showing up, how do you I change that?
+ filter(cyl=="6" & gear=="4") %>%
+ select(hp)
> v=mtcars %>%
+ filter(cyl=="6" & gear=="4") %>%
+ select(hp) %>%
+ summarise(max(hp))
This is the result
> v
max(hp)
1 123
Using base R functions with the help of magrittr's pipe operator for readability.
library(magrittr)
mtcars %>%
subset(cyl == 6 & gear == 4) %>%
subset(hp == max(hp)) %>%
rownames()
#[1] "Merc 280" "Merc 280C"
Note that the max hp value is same for above two cars so it returns both the car names.
We can create a column of rownames as the tidyverse would strip off the rownames attributes
library(dplyr)
mtcars %>%
rownames_to_column('rn') %>%
filter(cyl=="6" & gear=="4") %>%
select(rn, hp) %>%
slice(which.max(hp))
In this particular case, it is the summarise step removes the rownames. According to ?summarise
Data frame attributes are not preserved, because summarise() fundamentally creates a new data frame.

R - finding within groups

I would like to evaluate conditions within groups. While mtcars may not exactly match my data, here is my problem.
Let's to group mtcars by gear. Then I would like to get the subset of data, where within the gear group there is a row wehere 'carb' equals 1 and there is one where it is '4'. I want all the rows if there is a 1 + 4 and I would like to omit all the rows within the group if there isnt.
p <- arrange(mtcars, gear)
p <- filter(mtcars, carb == 1 & carb == 4)
This gives 0 obviously since there is not a single row where carb is has two values :)
The preferred outcome would be all the rows of mtcars where gear is 3 or 4. Omitting gear = 5 rows since within the group of gear 5, there isn't a carb == 1.
You can do:
mtcars %>%
group_by(gear) %>%
filter(any(carb == 1) & any(carb == 4))
Or:
mtcars %>%
group_by(gear) %>%
filter(all(c(1, 4) %in% carb))
An option with data.table
library(data.table)
as.data.table(mtcars)[, .SD[sum(c(1, 4) %in% carb)) == 2], gear]

Dplyr and multiple t test (keeping the same IV)

I'm using this pretty nice code to perform a multiple t.test keeping the independent variable constant!
data(mtcars)
library(dplyr)
vars_to_test <- c("disp","hp","drat","wt","qsec")
iv <- "vs"
mtcars %>%
summarise_each_(
funs_(
sprintf("stats::t.test(.[%s == 0], .[%s == 1])$p.value",iv,iv)
),
vars = vars_to_test)
Unfortunately, dplyr was updated and I've been facing this report
summarise_each() is deprecated. Use summarise_all(),
summarise_at() or summarise_if() instead. To map funs over a
selection of variables, use summarise_at()
When i change the code for _all, at or _if, this function doest not work any more. I'm looking for some advice and thanks much for your support.
Thanks
Instead of creating a string expression with sprintf and then evaluating it, we can use the evaluate the 'vs' converting it to symbol and then evaluate it
library(dplyr)
mtcars %>%
summarise_at(vars(vars_to_test), funs(
try(stats::t.test(.[(!! rlang::sym(iv)) == 0], .[(!! rlang::sym(iv)) == 1])$p.value)
))
# disp hp drat wt qsec
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06
If we really wanted to parse an expression, use the rlang_parse_expr and rlang::eval_tidy along with sym
library(rlang)
eval_tidy(parse_expr("mtcars %>% summarise_at(vars(vars_to_test),
funs(t.test(.[(!!sym(iv))==0],
.[(!!sym(iv))==1])$p.value ))"))
# disp hp drat wt qsec
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06

Resources