I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)
Related
Thanks for looking at this!
I want a function to build tables showing stats, such as the mean) for specific variables segrgated into groups.
Below is a start of a function that works up to a point! I use an example using the built in data for mtcars.
MeansbyGroup<-function(var){
M1<-mtcars %>% group_by(cyl)
n1=deparse(substitute(var))
r1<-transpose(M1 %>% summarise(disp=mean(var)))[2,]
}
# EXAMPLE using mtcars
df=MeansbyGroup(mtcars$disp)
df[nrow(df) + 1,] =MeansbyGroup(mtcars$drat)
df
# The above will output
V1 V2 V3
2 230.721875 230.721875 230.721875
2.1 3.596563 3.596563 3.596563
#which is not even the right means!
#below are the correct values...but I can't automate a table like I want
M1<-mtcars %>% group_by(cyl)
transpose(M1 %>% summarise(disp=mean(disp)))[2,]
transpose(M1 %>% summarise(disp=mean(drat)))[2,]
## Here is my desired output of means disaggregated into columns by the group "cyl"
## if the function worked right with the above example
V1 V2 V3
disp 105.1364 183.3143 353.1
drat 4.070909 3.585714 3.229286
As you will see, in the function I have "n1=deparse(substitute(var))" to capture the variable name which I would like to have in the first column, instead of 2 and 2.1 as shown in the example output.
I've tried a few techniques, but when I try to add n1 to the vector, it destroys the values of the means!
Also, I'd like to make the function more generalizable. For this example, I'd prefer the function call to look like MeansbyGroup(var,group,dataframe), which in the above example would be called by MeansbyGroup(disp,cyl,mtcars).
Thanks!
Here's how I would code your table outside of a function:
library(dplyr)
library(tibble)
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, drat), mean)) %>%
column_to_rownames("cyl") %>%
t
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
Using across if you might have multiple variables is quite nice. Putting this inside a function, we will need to use deparse(substitute()) because column_to_rownames requires a string argument for the column. But for the others we can use the friendly {{:
foo = function(data, group, vars) {
grp_name = deparse(substitute(group))
data %>%
group_by({{group}}) %>%
summarize(across({{vars}}, mean)) %>%
column_to_rownames(grp_name) %>%
t
}
foo(data = mtcars, group = cyl, vars = c(disp, drat))
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
I want to create a data frame with rows that repeat.
Here is my original dataset:
> mtcars_columns_a
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 disp mtcars mtcars$disp 230.72188
3 hp mtcars mtcars$hp 146.68750
Here is my desire dataset
> mtcars_columns_b
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 mpg mtcars mtcars$mpg 20.09062
3 disp mtcars mtcars$disp 230.72188
4 disp mtcars mtcars$disp 230.72188
5 hp mtcars mtcars$hp 146.68750
6 hp mtcars mtcars$hp 146.68750
I know how to do this the long way manually, but this is time consuming and rigid. Is there a quicker way to do this that is more automated and flexible?
Here is the code I used to create the dataset:
# mtcars data
## displays data
mtcars
## 3 row data set
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_a <-
data.frame(
c(
"mpg",
"disp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_a)[names(mtcars_columns_a) == 'c..mpg....disp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_a$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_a$data_set_and_variables_interest <-
paste(mtcars_columns_a$data_set,mtcars_columns_a$variables_interest,sep = "$")
### creates mean column
mtcars_columns_a$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$hp)
)
## 6 row data set., the long way
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_b <-
data.frame(
c(
"mpg",
"mpg",
"disp",
"disp",
"hp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_b)[names(mtcars_columns_b) == 'c..mpg....mpg....disp....disp....hp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_b$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_b$data_set_and_variables_interest <-
paste(mtcars_columns_b$data_set,mtcars_columns_b$variables_interest,sep = "$")
### creates mean column
mtcars_columns_b$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$disp),
mean(mtcars$hp),
mean(mtcars$hp)
)
You can try rep like below
mtcars_columns_a[rep(seq(nrow(mtcars_columns_a)), each = 2),]
Another option is uncount
library(dplyr)
library(tidyr)
mtcars_columns_a %>%
uncount(2)
Based on your expected output is this the sort of thing you were after?
The selection of required variables is made with the select function and the mean calculated using the summarise function following group_by variables.
The duplication of data and adding of additional variables (not really sure if these are necessary) is carried out using mutate.
You can edit variable names using the dplyr::rename function.
library(dplyr)
library(tidyr)
df <-
mtcars %>%
select(mpg, disp, hp) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value))
df1 <-
bind_rows(df, df) %>%
arrange(name) %>%
mutate(dataset = "mtcars",
variable = paste(dataset, name, sep = "$"))
df1
#> # A tibble: 6 x 4
#> name mean dataset variable
#> <chr> <dbl> <chr> <chr>
#> 1 disp 231. mtcars mtcars$disp
#> 2 disp 231. mtcars mtcars$disp
#> 3 hp 147. mtcars mtcars$hp
#> 4 hp 147. mtcars mtcars$hp
#> 5 mpg 20.1 mtcars mtcars$mpg
#> 6 mpg 20.1 mtcars mtcars$mpg
Created on 2021-04-06 by the reprex package (v1.0.0)
The order of records in a data.frame object is usually not meaningful, so you could just do:
rbind(mtcars_columns_a, mtcars_columns_a)
If you need it to be in the order you showed, this is also simple:
mtcars_columns_b <- rbind(mtcars_columns_a, mtcars_columns_a)
mtcars_columns_b[order(mtcars_columns_b, mtcars_columns_b$name),]
It's my first try with lists. I try to clean up my code and put some code in functions.
One idea is to subset a big dataframe in multiple subsets with a function. So I can call the subset with a function when needed.
With the mtcars dataframe I would like to explain what I am trying to do:
add an id to mtcars
create a function with one argument (mtcars) that outputs a list of subset dataframes (mtcars1, mtcars2, mtcars3) -> learned here: How to assign from a function which returns more than one value? answer by Federico Giorgi
What I achieve is to create the list. But when it comes to see the 3 subset dataframe objects (mtcars1, mtcars2, mtcars3) in the global environment my knowledge is ending. So how can I call these 3 dataframe objects from the list with my function. Thanks!
My Code:
library(dplyr)
# add id to mtcars
mtcars <- mtcars %>%
mutate(id = row_number())
# create function to subset in 3 dataframes
my_func_cars <- function(input){
# first subset
mtcars1 <- mtcars %>%
select(id, mpg, cyl, disp)
# second subset
mtcars2 <- mtcars %>%
select(id, hp, drat, wt, qsec)
# third subset
mtcars3 <- mtcars %>%
select(id, vs, am, gear, carb)
output <- list(mtcars1, mtcars2, mtcars3)
return(output)
}
output<-my_func_cars(mtcars)
for (i in output) {
print(i)
}
It may be better to output a named list
library(dplyr)
library(stringr)
my_func_cars <- function(input){
nm1 <- deparse(substitute(input))
# first subset
obj1 <- input %>%
select(id, mpg, cyl, disp)
# second subset
obj2 <- input %>%
select(id, hp, drat, wt, qsec)
# third subset
obj3 <- input %>%
select(id, vs, am, gear, carb)
dplyr::lst(!! str_c(nm1, 1) := obj1,
!! str_c(nm1, 2) := obj2,
!! str_c(nm1, 3) := obj3)
}
and then we use list2env to create objects in the global env
mtcars <- mtcars %>%
mutate(id = row_number())
list2env(my_func_cars(mtcars), .GlobalEnv)
-check the objects
head(mtcars1, 2)
# id mpg cyl disp
#1 1 21 6 160
#2 2 21 6 160
head(mtcars2, 2)
# id hp drat wt qsec
#1 1 110 3.9 2.620 16.46
#2 2 110 3.9 2.875 17.02
head(mtcars3, 2)
# id vs am gear carb
#1 1 0 1 4 4
#2 2 0 1 4 4
I would like to evaluate conditions within groups. While mtcars may not exactly match my data, here is my problem.
Let's to group mtcars by gear. Then I would like to get the subset of data, where within the gear group there is a row wehere 'carb' equals 1 and there is one where it is '4'. I want all the rows if there is a 1 + 4 and I would like to omit all the rows within the group if there isnt.
p <- arrange(mtcars, gear)
p <- filter(mtcars, carb == 1 & carb == 4)
This gives 0 obviously since there is not a single row where carb is has two values :)
The preferred outcome would be all the rows of mtcars where gear is 3 or 4. Omitting gear = 5 rows since within the group of gear 5, there isn't a carb == 1.
You can do:
mtcars %>%
group_by(gear) %>%
filter(any(carb == 1) & any(carb == 4))
Or:
mtcars %>%
group_by(gear) %>%
filter(all(c(1, 4) %in% carb))
An option with data.table
library(data.table)
as.data.table(mtcars)[, .SD[sum(c(1, 4) %in% carb)) == 2], gear]
I'd like to apply a transformation to all columns via dplyr::mutate_each, e.g.
library(dplyr)
mult <- function(x,m) return(x*m)
mtcars %>% mutate_each(funs(mult(.,2))) # Multiply all columns by a factor of two
However, the transformation should have parameters depending on the column name. Therefore, the column name should be passed to the function as an additional argument
named.mult <- function(x,colname) return(x*param.A[[colname]])
Example: multiply every column by a different factor:
param.A <- c()
param.A[names(mtcars)] <- seq(length(names(mtcars)))
param.A
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 2 3 4 5 6 7 8 9 10 11
Since the column name gets lost during mutate_each, I currently work around this by passing a list with lazy evalution to mutate_ (the SE version):
library(lazyeval)
named.mutate <- function(fun, cols) sapply(cols, function(n) interp(~fun(col, n), fun=fun, col=as.name(n)))
mtcars %>% mutate_(.dots=named.mutate(named.mult, names(.)))
Works, but is there some special variable like .name which contains the column name of . for each colwise execution? So I could do something like
mtcars %>% mutate_each(funs(named.mult(.,.name)))
I'd suggest taking a different approach. Instead of using mutate_each a combination of dplyr::mutate with tidyr::gather and tidyr::spread can achieve the same result.
For example:
library(dplyr)
library(tidyr)
data(mtcars)
# Multiple each column by a different interger
mtcars %>%
dplyr::tbl_df() %>%
dplyr::mutate(make_and_model = rownames(mtcars)) %>%
tidyr::gather(key, value, -make_and_model) %>%
dplyr::mutate(m = as.integer(factor(key)), # a multiplication factor dependent on column name
value = value * m) %>%
dplyr::select(-m) %>%
tidyr::spread(key, value)
# compare to the original data
mtcars[order(rownames(mtcars)), order(names(mtcars))]
# the muliplicative values used.
mtcars %>%
tidyr::gather() %>%
dplyr::mutate(m = as.integer(factor(key))) %>%
dplyr::select(-value) %>%
dplyr::distinct()