Apply function table() to each column of a data.frame using dplyr
I often apply the table-function on each column of a data frame using plyr, like this:
library(plyr)
ldply( mtcars, function(x) data.frame( table(x), prop.table( table(x) ) ) )
Is it possible to do this in dplyr also?
My attempts fail:
mtcars %>% do( table %>% data.frame() )
melt( mtcars ) %>% do( table %>% data.frame() )
You can try the following which does not rely on the tidyr package.
mtcars %>%
lapply(table) %>%
lapply(as.data.frame) %>%
Map(cbind,var = names(mtcars),.) %>%
rbind_all() %>%
group_by(var) %>%
mutate(pct = Freq / sum(Freq))
Using tidyverse (dplyr and purrr):
library(tidyverse)
mtcars %>%
map( function(x) table(x) )
Or:
mtcars %>%
map(~ table(.x) )
Or simply:
library(tidyverse)
mtcars %>%
map( table )
In general you probably would not want to run table() on every column of a data frame because at least one of the variables will be unique (an id field) and produce a very long output. However, you can use group_by() and tally() to obtain frequency tables in a dplyr chain. Or you can use count() which does the group_by() for you.
> mtcars %>%
group_by(cyl) %>%
tally()
> # mtcars %>% count(cyl)
Source: local data frame [3 x 2]
cyl n
1 4 11
2 6 7
3 8 14
If you want to do a two-way frequency table, group by more than one variable.
> mtcars %>%
group_by(gear, cyl) %>%
tally()
> # mtcars %>% count(gear, cyl)
You can use spread() of the tidyr package to turn that two-way output into the output one is used to receiving with table() when two variables are input.
Solution by Caner did not work but from comenter akrun (credit goes to him), this solution worked great. Also using a much larger tibble to demo it. Also I added an order by percent descending.
library(nycflights13);dim(flights)
tte<-gather(flights, Var, Val) %>%
group_by(Var) %>% dplyr::mutate(n=n()) %>%
group_by(Var,Val) %>% dplyr::mutate(n1=n(), Percent=n1/n)%>%
arrange(Var,desc(n1) %>% unique()
Related
I have created this function that quickly does some summarization operations (mean, median, geometric mean and arranges them in descending order). This is the function:
summarize_values <- function(tbl, variable){
tbl %>%
summarize(summarized_mean = mean({{variable}}),
summarized_median = median({{variable}}),
geom_mean = exp(mean(log({{variable}}))),
n = n()) %>%
arrange(desc(n))
}
I can do this and it works:
summarize_values(data, lifeExp)
However, I would like to be able to do this:
data %>%
select(year, lifeExp) %>%
summarize_values()
or something like this
data %>%
summarize_values(year, lifeExp)
What am I missing to make this work?
thanks
With pipe, we don't need to specify the first argument which is the tbl,
library(dplyr)
data %>%
summarize_values(lifeExp)
-reproducible example
> mtcars %>%
summarize_values(gear)
summarized_mean summarized_median geom_mean n
1 3.6875 4 3.619405 32
I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...
Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)
I have a data set that lists badbuy cars by state. I want to calculate the top 10 states with the badcars by percentage per state
The code that isn't working is:
carDF2 = carDF %>% filter(!is.na(IsBadBuy)) %>% group_by(VNST) %>%
mutate(PBadBuy = round(IsBadBuy/sum(IsBadBuy), 3))
Data Table:
Something like this perhaps? I have used the mtcars dataset. Replace the variables with your own.
You would add your filter as well
mtcars
mt2 <- mtcars %>%
filter(!is.na(gear)) %>%
select(cyl, gear) %>% #select the columns you want
group_by(cyl) %>% # for you it is VSNT
count %>% #because it isn't an integer
ungroup %>% # so that values aren't 1
mutate(prop_n = n/sum(n))
mt2
This will give you proportions.
Lets say I want to split out mtcars into 3 csv files based on their cyl grouping. I can use mutate to do this, but it will create a NULL column in the output.
library(tidyverse)
by_cyl = mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
mutate(unused = map2(data, cyl, function(x, y) write.csv(x, paste0(y, '.csv'))))
is there a way to do this on the by_cyl object without calling mutate?
Here is an option using purrr without mutate from dplyr.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update
This drops the cyl column before saving the output.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ .x %>% select(-cyl)) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update2
library(tidyverse)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x[["data"]][[1]], paste0(.y, '.csv')))
Here's a solution with do and group_by, so if your data is already grouped as it should, you save one line:
mtcars %>%
group_by(cyl) %>%
do(data.frame(write.csv(.,paste0(.$cyl[1],".csv"))))
data.frame is only used here because do needs to return a data.frame, so it's a little hack.
How can make I several, sequential manipulations of the same variable using dplyr, but more elegantly than the code below?
Specifically, I would like to remove the multiple calls to car_names = without having to nest any of the functions.
mtcars2 <- mtcars %>% mutate(car_names = row.names(.)) %>%
mutate(car_names=stri_extract_first_words(car_names)) %>%
mutate(car_names=as.factor(car_names)
If you want to type less and not nest the function, you can use the pipe inside the mutate call :
library(dplyr)
library(stringi)
# What you did
mtcars2 <- mtcars %>%
mutate(car_names = row.names(.)) %>%
mutate(car_names = stri_extract_first_words(car_names)) %>%
mutate(car_names = as.factor(car_names))
# Another way with less typing and no nesting
mtcars3 <- mtcars %>%
mutate(car_names = rownames(.) %>%
stri_extract_first_words(.) %>%
as.factor(.))
identical(mtcars2, mtcars3)
[1] TRUE