I am cleaning some data and like to use the count() function in dplyr to look at unique values of every variable.
Is there a way to do this automatically? Right now I am using this method:
df %>% count(variable1)
df %>% count(variable2)
df %>% count(variable3)
...
I would like something that returns all of them without me having to repeat the line of code and type in each variable. I thought about trying to have R recognize all the column names and automatically fill them in but I'm not sure where to start. If I just add variables together, say
df %>% count(variable1, variable2)
I get counts by both of those variables when I want individual tables for each variable.
Assume that you want to count am, gear, and carb from mtcars. You can apply the function table() on each variable by map(), which returns a list object.
library(dplyr)
library(purrr)
mtcars %>%
select(am, gear, carb) %>%
map(table)
# $am
# 0 1
# 19 13
#
# $gear
# 3 4 5
# 15 12 5
#
# $carb
# 1 2 3 4 6 8
# 7 10 3 10 1 1
base Version :
lapply(mtcars[c("am", "gear", "carb")], table)
In addition, you can use summary(), which counts factor variables.
mtcars %>%
select(am, gear, carb) %>%
mutate(across(.fn = as.factor)) %>%
summary
# am gear carb
# 0:19 3:15 1: 7
# 1:13 4:12 2:10
# 5: 5 3: 3
# 4:10
# 6: 1
# 8: 1
It looks like you can use a tidyverse approach to solve your issue. You want to get the counts for each variable in your dataset (Please next time add a sample of df). You can get something close to what you want using data in long format. I will show you an example with mtcars data. I will choose some variables that display classes so that they can be summarised with counts. Here the code:
library(tidyverse)
#Data
data("mtcars")
I will select some categorical variables with next code, then I will reshape to long. Finally, I will use summarise() and n() (used for counting) with group_by() to determine the counts:
#Code
mtcars %>% select(cyl,vs,am,gear,carb) %>%
#Format to long
pivot_longer(cols = everything()) %>%
#Group and summarise
group_by(name,value) %>%
summarise(N=n())
Output:
# A tibble: 16 x 3
# Groups: name [5]
name value N
<chr> <dbl> <int>
1 am 0 19
2 am 1 13
3 carb 1 7
4 carb 2 10
5 carb 3 3
6 carb 4 10
7 carb 6 1
8 carb 8 1
9 cyl 4 11
10 cyl 6 7
11 cyl 8 14
12 gear 3 15
13 gear 4 12
14 gear 5 5
15 vs 0 18
16 vs 1 14
As you can see all the variables are showed with their respective groups and counts.
a simple solution would be to use sapply or lapply with table
sapply(df,table)
This will return you a list of count tables for each of the columns for dt. You can always pass in a subsetted dataframe to get the count for your variables of interest.
Related
I need to count the number of occurrences of specific values in each column, and then do a for loop for that to run that count() function for the entire dataframe (consisting of several thousand columns).
For instance, if I have a column consisting of: [0,0,0,1,1,0,0,0,0,0,0,0]. I want it to count the column and return for me the information of:
1 -> 2 counts
0 -> 10 counts
The dataframe that I have consists entirely of only 0s and 1s. I just need to count how many of them are in each column, but that dataframe has over a few thousand columns.
Currently, my for loop code doesnt work, it seems to only register the first column and keep printing that same first column result over and over again. Thanks everyone!!
s <- 0
yes_filt_high_mutation <- data.frame();
for(c in colnames(high_mutations)[2:ncol(high_mutations)]){ #high_mutations = my dataframe
mutation_results = high_mutations %>% count(high_mutations$c); #Count the # of 0s and 1s in each column
print(c)
print(mutation_results)
s <- s + 1
add_column <- c(c,mutation_results[1,2],mutation_results[2,2])
yes_filt_high_mutation <- rbind(data.frame(yes_filt_high_mutation), add_column)
}
names(yes_filt_high_mutation)[1] <- "Samples"
names(yes_filt_high_mutation)[2] <- "Number of 0's"
names(yes_filt_high_mutation)[3] <- "Number of 1's"
I want my result to be something like this, for each loop result:
So essentially tell me that there are 134 counts of 0 and 2 counts of 1 in Column 1.
high_mutations$Column1 n
1 0 134
2 1 2
I would suggest that you reflect on the desired final format. If your intention is get a count of observations within a column you can obtain that by using common verbs available in tidyverse.
library(tidyverse)
select(mtcars, cyl, vs, gear) %>%
pivot_longer(cols = everything()) %>%
group_by(name, value) %>%
summarise(ndist = n())
#> `summarise()` has grouped output by 'name'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 3
#> # Groups: name [3]
#> name value ndist
#> <chr> <dbl> <int>
#> 1 cyl 4 11
#> 2 cyl 6 7
#> 3 cyl 8 14
#> 4 gear 3 15
#> 5 gear 4 12
#> 6 gear 5 5
#> 7 vs 0 18
#> 8 vs 1 14
Created on 2022-04-16 by the reprex package (v2.0.1)
Explanation
For sake of simplicity a set of columns is reduced to only include vs, cyl and gear via the select verb.
Data is transformed to a long format to make grouping easier via pivot_longer available through tidyr
The key element is counting occurrences of each combination, if I understood your request, this is your goal. So in this case for column cyl we get 11 instances of value 4, 7 instances of value 6 and so on
Optional
You can transform that data into a wide format using pivot_wider but I wouldn't rush that as nicely formatted long data is frequently easier to work with
Wider remarks
Looping over columns in a data frame is generally not advisable practice. R offers a number of optimised, robust and mature approaches to achieve similar objectives. apply functions available in base R or across verb offered via tidyverse are a good starting points
You may wish to reflect on refining your requirements. As it was observed in the comments, are you in effect looking for an output similar to table(mtcars$cyl) plus some additional embellishments?
Alternative solution
If you are not too fussed about the output format you could also leverage map.
library(tidyverse)
select(mtcars, cyl, vs, gear) %>%
map(~ table(.x))
#> $cyl
#> .x
#> 4 6 8
#> 11 7 14
#>
#> $vs
#> .x
#> 0 1
#> 18 14
#>
#> $gear
#> .x
#> 3 4 5
#> 15 12 5
Created on 2022-04-16 by the reprex package (v2.0.1)
You will arrive at identical result but as a list, you may wish to pack those in a data frame but if you will intend to do that staying with group_by is probably a more straightforward.
I have a dataframe with data from a survey. I would like to produce a report in table format with the frequencies of each variable.
So working with the dataset mtcars, having this:
> count(mtcars, cyl)
cyl n
1 4 11
2 6 7
3 8 14
> count(mtcars, gear)
gear n
1 3 15
2 4 12
3 5 5
I would like to produce a table like this (or something similar):
variable
n
cyl
4
11
6
7
8
14
gear
3
15
4
12
5
5
Any idea as to how this may be achievable?
We can write a nested pair of functions to map count to multiple variables and row-bind the results, using a little tidy evaluation:
library(dplyr)
library(purrr)
count_multi <- function(.data, ...) {
count_var <- function(var, .data) {
.data %>%
count(Value = factor({{ var }})) %>% # coerce to factor to allow multiple
mutate( # var types and preserve ordering
Variable = as.character(ensym(var)),
.before = everything()
)
}
map_dfr(enquos(...), count_var, .data = .data)
}
mtcars2 <- mtcars %>%
mutate(
vs = factor(vs, labels = c("V", "S")),
am = factor(am, labels = c("manual", "automatic"))
)
mtcars2 %>%
count_multi(vs, am, cyl)
Output:
Variable Value n
1 vs V 18
2 vs S 14
3 am manual 19
4 am automatic 13
5 cyl 4 11
6 cyl 6 7
7 cyl 8 14
I believe you can use kableExtra::pack_rows() to create subheaders for each Variable in markdown.
The below gets us the output in slightly different format. However, it does allow for subset (using column variable which OP's requirement does not.)
library(data.table)
df <- setDT(copy(mtcars))
# select columns as grouping by continuous variables is not appropriate
x <- c('cyl', 'gear')
y <- lapply(x, \(i) df[, .N, i])
names(y) <- x
y <- rbindlist(y, idcol=T, use.names=F)
names(y) <- c('variable', 'class', 'count')
variable class count
1: cyl 6 7
2: cyl 4 11
3: cyl 8 14
4: gear 4 12
5: gear 3 15
6: gear 5 5
I want to pass multiple columns to one UDF argument in the tidy way (so as bare column names).
Example: I have a simple function which takes a column of the mtcars dataset as an input and uses that as the grouping variable to do an easy count operation with summarise.
library(tidyverse)
test_function <- function(grps){
grps <- enquo(grps)
mtcars %>%
group_by(!!grps) %>%
summarise(Count = n())
}
Result if I execute the function with "cyl" as the grouping variable:
test_function(grps = cyl)
-----------------
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
Now imagine I want to pass multiple columns to the argument "grps" so that the dataset is grouped by more columns. Here is what I imagine some example function executions could look like:
test_function(grps = c(cyl, gear))
test_function(grps = list(cyl, gear))
Here is what the expected result would look like:
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Is there a way to pass multiple bare columns to one argument of a UDF? I know about the "..." operator already but since I have in reality 2 arguments where I want to possibly pass more than one bare column as an argument the "..." is not feasible.
You can use the across() function with embraced arguments for this which works for most dplyr verbs. It will accept bare names or character strings:
test_function <- function(grps){
mtcars %>%
group_by(across({{ grps }})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
test_function(grps = c("cyl", "gear"))
# Same output
Consider a code and output like this
mtcars %>% group_by(gear) %>% summarize(cyls=paste(unique(cyl),collapse=','))
# A tibble: 3 x 2
gear cyls
<dbl> <chr>
1 3 6,8,4
2 4 6,4
3 5 4,8,6
For each gear - I am given a "vector" of unique cyls
A user wants to keep results of an operation inside a data.frame (e.g., which age categories for each event are important) but as a vector (for input into another function).
How can the code be rewritten to output not a char but a vector of cyls? Code below fails.
mtcars %>% group_by(gear) %>% summarize(cyls=unique(cyl))
The "output" shold be a vector type of column and something like
gear cyls
3 c(6,8,4)
4 c(6,4)
5 c(4,8,6)
Make cyls a list column like this:
my_df <- mtcars %>% group_by(gear) %>% summarize(cyls=list(unique(cyl)))
my_df$cyls[[1]] # boom dbl vectors for each row stored as a list
[1] 6 8 4
using base R we can have:
(dat=aggregate(cyl~gear,mtcars,unique))
gear cyl
1 3 6, 8, 4
2 4 6, 4
3 5 4, 8, 6
Where
dat$cyl
$`1`
[1] 6 8 4
$`2`
[1] 6 4
$`3`
[1] 4 8 6
Using dplyr you will have to coerce the final tibble into a dataframe to be able to obtain the same results
I would like to make tables for publication that give the number of observations, grouped by two variables. The code for this works fine. However, I have run into problems when trying to turn this into a function.
I am using dplyr_0.7.2
Example using mtcars:
Code for table outside of function: this works
library(tidyverse)
tab1 <- mtcars %>% count(cyl) %>% rename(Total = n)
tab2 <- mtcars %>%
group_by(cyl, gear) %>% count %>%
spread(gear, n)
tab <- full_join(tab1, tab2, by = "cyl")
tab
# This is the output (which is what I want)
A tibble: 3 x 5
cyl Total `3` `4` `5`
<dbl> <int> <int> <int> <int>
1 4 11 1 8 2
2 6 7 2 4 1
3 8 14 12 NA 2
Trying to put this into a function
Function for tab1: this works
count_by_two_groups_A <- function(df, var1){
var1 <- enquo(var1)
tab1 <- df %>% count(!!var1) %>% rename(Total = n)
tab1
}
count_by_two_groups_A(mtcars, cyl)
A tibble: 3 x 2
cyl Total
<dbl> <int>
1 4 11
2 6 7
3 8 14
Function for first part of tab2: it works up to this point, but...
count_by_two_groups_B <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
tab2 <- df %>% group_by((!!var1), (!!var2)) %>% count
tab2
}
count_by_two_groups_B(mtcars, cyl, gear)
A tibble: 8 x 3
Groups: (cyl), (gear) [8]
`(cyl)` `(gear)` n
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
The column names have changed to (cyl) and (gear). I can't seem to figure out how to carry on with spread() and full_join() (or anything else using the new column names) now that the column names have changed. I.e. I can't figure out how to specify the new column names in the tidyeval way, to be able to carry on. I have tried various things, without success.
The usual way of setting names in a tidyeval context is to use the definition operator :=. It would look like this:
df %>%
group_by(
!! nm1 := !! var1,
!! nm2 := !! var2
) %>%
count()
For this you need to extract nm1 from var1. Unfortunately I don't have an easy way of stripping down the enclosing parentheses yet. I think it'd make sense to do it in the forthcoming function ensym() (it captures symbols instead of quosures and issue an error if you supply a call). I have submitted a ticket here: https://github.com/tidyverse/rlang/issues/223
Fortunately we have two easy solutions here. First note that you don't need the enclosing parentheses. They are only needed when other operators are involved in the captured expression. E.g. in these situations:
(!! var) / avg
(!! var) < value
In this case if you omitted parentheses, !! would try to unquote the whole expressions instead of just the one symbol. On the other hand in your function there is no operator so you can safely unquote without enclosing:
count_by_two_groups_B <- function(df, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
df %>%
group_by(!! var1, !! var2) %>%
count()
}
Finally, you could make your function more general by allowing a variable number of arguments. This is even easier to implement because dots are forwarded so there is no need to capture and unquote. Just pass them down to group_by():
count_by <- function(df, ...) {
df %>%
group_by(...) %>%
count()
}
I can make it work with NSE (non-standard evaluation). Could not do it with tidyverse as I did not have that installed and did not bother installing.
Here is a working code:
library(dplyr)
library(tidyr)
count_by_two_groups_B <- function(df, var1, var2){
# var1 <- enquo(var1)
# var2 <- enquo(var2)
tab2 <- df %>% group_by_(var1, var2) %>% summarise(n = n() ) %>%spread(gear, n)
tab2
}
count_by_two_groups_B(mtcars, 'cyl', 'gear')
Result:
# A tibble: 3 x 4
# Groups: cyl [3]
cyl `3` `4` `5`
* <dbl> <int> <int> <int>
1 4 1 8 2
2 6 2 4 1
3 8 12 NA 2
This is one of those situations where reaching for dplyr or tidyverse seems excessive. There are base functions to do this ... table and to make the results in long form, as.dataframe:
as.data.frame( with(mtcars, table(cyl,gear)) , responseName="Total")
#--------
cyl gear Total
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
This would be one dplyr approach:
mtcars %>% group_by(cyl,gear) %>% summarise(Total=n())
#----
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear Total
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
And if the question was how to get this as a table object (thinking that might have been your goal with spread then just:
with(mtcars, table(cyl,gear))