I have data on 30 people that includes ethnicity, gender, school type, whether they received free school meals, etc.
I want to produce frequency counts for all of these features. Currently my code looks like this:
df <- read.csv("~file")
df %>% select(Ethnicity) %>% group_by(Ethnicity) %>% summarise(freq = n())
df %>% select(Gender) %>% group_by(Gender) %>% summarise(freq = n())
df %>% select(School.type) %>% group_by(School.type) %>% summarise(freq = n())
Is there a way I can create a frequency tibble for 8 columns (e.g. ethnicity, gender, school type, etc.) in a more efficient way (e.g. 1 or 2 lines of code)?
As an example output for the ethnicity code:
# A tibble: 13 × 2
Ethnicity freq
<chr> <int>
1 Asian or Asian British - Bangladeshi 1
2 Asian or Asian British - Indian 7
3 Asian or Asian British - Pakistani 1
4 Black or Black British - African 5
5 Black or Black British - Caribbean 2
6 Chinese 3
7 Mixed - White and Asian 2
8 Mixed - White and Black African 1
9 Mixed - White and Black Caribbean 1
10 Not known/ prefer not to say 1
11 White British 27
12 White Irish 1
13 White Other 5
And for gender:
# A tibble: 2 × 2
Gender freq
<chr> <int>
1 Female 36
2 Male 21
NB: some columns also contain data on postcode & name which I obviously don't want to perform the frequency function on, so I think I'll somehow need to select just the columns I want to perform this function on
One option would be to use lapply to loop over a vector of your desired columns and dplyr::count for the frequency table.
Using the starwars dataset as example data:
library(dplyr, warn = FALSE)
cols <- c("hair_color", "sex")
lapply(cols, function(x) {
count(starwars, .data[[x]], name = "freq")
})
#> [[1]]
#> # A tibble: 13 × 2
#> hair_color freq
#> <chr> <int>
#> 1 auburn 1
#> 2 auburn, grey 1
#> 3 auburn, white 1
#> 4 black 13
#> 5 blond 3
#> 6 blonde 1
#> 7 brown 18
#> 8 brown, grey 1
#> 9 grey 1
#> 10 none 37
#> 11 unknown 1
#> 12 white 4
#> 13 <NA> 5
#>
#> [[2]]
#> # A tibble: 5 × 2
#> sex freq
#> <chr> <int>
#> 1 female 16
#> 2 hermaphroditic 1
#> 3 male 60
#> 4 none 6
#> 5 <NA> 4
Related
starwars %>%
group_by(species,sex) %>%
summarise() %>%
select(unique.species=species, unique.sex=sex)
How to get unique values from 2 columns("species","sex") all together? I wrote the code above but i'm not sure it's right. Thank you
library(tidyverse)
starwars |>
select(species, sex) |>
distinct()
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Human male
#> 2 Droid none
#> 3 Human female
#> 4 Wookiee male
#> 5 Rodian male
#> 6 Hutt hermaphroditic
#> 7 Yoda's species male
#> 8 Trandoshan male
#> 9 Mon Calamari male
#> 10 Ewok male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)
library(tidyverse)
starwars %>%
expand(nesting(species, sex))
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Aleena male
#> 2 Besalisk male
#> 3 Cerean male
#> 4 Chagrian male
#> 5 Clawdite female
#> 6 Droid none
#> 7 Dug male
#> 8 Ewok male
#> 9 Geonosian male
#> 10 Gungan male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)
There are multiple options. You can use the following code:
unique(starwars[c("species", "sex")])
Output:
species sex
<chr> <chr>
1 Human male
2 Droid none
3 Human female
4 Wookiee male
5 Rodian male
6 Hutt hermaphroditic
7 Yoda's species male
8 Trandoshan male
9 Mon Calamari male
10 Ewok male
# … with 31 more rows
I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?
You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA
I am looking for an easy, concise way to use dplyr::select without rearranging columns.
Consider this dataset:
library(tidyverse)
head(msleep)
#> # A tibble: 6 × 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
#> 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
#> 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
#> 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
#> 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
If I select vore, genus and name, the resulting dataframe is arranged in the order in which the columns were provided.
msleep %>% select(vore, genus, name)
#> # A tibble: 83 × 3
#> vore genus name
#> <chr> <chr> <chr>
#> 1 carni Acinonyx Cheetah
#> 2 omni Aotus Owl monkey
#> 3 herbi Aplodontia Mountain beaver
#> 4 omni Blarina Greater short-tailed shrew
#> 5 herbi Bos Cow
#> 6 herbi Bradypus Three-toed sloth
#> 7 carni Callorhinus Northern fur seal
#> 8 <NA> Calomys Vesper mouse
#> 9 carni Canis Dog
#> 10 herbi Capreolus Roe deer
#> # … with 73 more rows
I would instead like to leave them in their default order: name, genus, then vore.
I have a solution (see below), but I do not like it because it is quite wordy, and not completely “tidyverse-esque”.
(I am teaching an intro to tidyverse course, and would like something that would not intimidate beginners.)
msleep %>%
select(all_of(names(msleep)[names(msleep) %in% c("vore", "genus", "name")]))
#> # A tibble: 83 × 3
#> name genus vore
#> <chr> <chr> <chr>
#> 1 Cheetah Acinonyx carni
#> 2 Owl monkey Aotus omni
#> 3 Mountain beaver Aplodontia herbi
#> 4 Greater short-tailed shrew Blarina omni
#> 5 Cow Bos herbi
#> 6 Three-toed sloth Bradypus herbi
#> 7 Northern fur seal Callorhinus carni
#> 8 Vesper mouse Calomys <NA>
#> 9 Dog Canis carni
#> 10 Roe deer Capreolus herbi
#> # … with 73 more rows
Is there such a thing? Thank you!
For context: In reality, we have a data frame with about 400 columns, from which we are selecting ~10-20 at a time to work with. The order of the columns in the original data frame is meaningful, but we don't want to have to labor over listing them in their correct order in the select statements. A very specific need, I'll admit.
Created on 2021-12-22 by the reprex package (v2.0.1)
We could use match with sort
library(dplyr)
msleep %>%
select(sort(match(c("vore", "genus", "name"), names(.))))
EDIT: Based on the OP's comments
Update:
In case of providing a vector we could do as akrun suggests in the comments:
nm1 <- c("vore", "genus", "name"); pattern <- str_c(nm1, collapse="|")
Original answer:
You could first define a string with the search items
and then use matches
pattern <- c("vore|genus|name")
select(msleep, matches(pattern))
name genus vore
<chr> <chr> <chr>
1 Cheetah Acinonyx carni
2 Owl monkey Aotus omni
3 Mountain beaver Aplodontia herbi
4 Greater short-tailed shrew Blarina omni
5 Cow Bos herbi
6 Three-toed sloth Bradypus herbi
7 Northern fur seal Callorhinus carni
8 Vesper mouse Calomys NA
9 Dog Canis carni
10 Roe deer Capreolus herbi
You can use the power of eval_select() to create a function to select and sort the columns.
library(dplyr)
select_in_order <- function(data, ...) {
ordered_cols <- sort(tidyselect::eval_select(expr(c(...)), data))
select(data, ordered_cols)
}
So now this will do what you are asking. The benefit is that it will be "full feature" to what you are used to being able to enter into a select() statement.
# library(ggplot2) # msleep is in ggplot2
msleep %>%
select_in_order(vore, genus, name)
# this will work as well
msleep %>%
select_in_order(starts_with("sleep"), vore, name:genus)
EDIT
As another option, simply use relocate() after your select() statement. This alternative approach accomplishes your end goal of keeping the columns in order in a way that is easy to understand by a beginner.
msleep %>%
select(vore, genus, name) %>%
relocate(any_of(names(msleep)))
I recently started to write my own functions to speed up standard and repetitive task while analyzing data with R.
At the moment I'm working on a function with three arguments and ran into a challenge I could not solve yet. I would like to have an optional grouping argument. During the process the function should check if there is a grouping argument and then continue using either subfunction 1 or 2.
But I always get the error "Object not found" if the grouping argument is not NA. How can I do this?
Edit: In my case the filter usually is used to filter certain valid or invalid years. If there is a grouping argument there will follow more steps in the pipe than if there is none.
require(tidyverse)
Data <- mpg
userfunction <- function(DF,Filter,Group) {
without_group <- function(DF) {
DF %>%
count(year)
}
with_group <- function(DF) {
DF %>%
group_by({{Group}}) %>%
count(year) %>%
pivot_wider(names_from=year, values_from=n) %>%
ungroup() %>%
mutate(across(.cols=2:ncol(.),.fns=~replace_na(.x, 0))) %>%
mutate(Mittelwert=round(rowMeans(.[,2:ncol(.)],na.rm=TRUE),2))
}
Obj <- DF %>%
ungroup() %>%
{if(Filter!=FALSE) filter(.,eval(rlang::parse_expr(Filter))) else filter(.,.$year==.$year)} %>%
{if(is.na(Group)) without_group(.) else with_group(.)}
return(Obj)
}
For NA it already works:
> Data %>%
+ userfunction(FALSE,NA)
# A tibble: 2 x 2
year n
<int> <int>
1 1999 117
2 2008 117
With argument it does not work:
> Data %>%
+ userfunction(FALSE,manufacturer)
Error in DF %>% ungroup() %>% { : object 'manufacturer' not found
Edit:
What I would expect from the above function would be the following output:
> Data %>% userfunction_exp(FALSE,manufacturer)
# A tibble: 15 x 4
manufacturer `1999` `2008` Mittelwert
<chr> <dbl> <dbl> <dbl>
1 audi 9 9 9
2 chevrolet 7 12 9.5
3 dodge 16 21 18.5
4 ford 15 10 12.5
5 honda 5 4 4.5
6 hyundai 6 8 7
7 jeep 2 6 4
8 land rover 2 2 2
9 lincoln 2 1 1.5
10 mercury 2 2 2
11 nissan 6 7 6.5
12 pontiac 3 2 2.5
13 subaru 6 8 7
14 toyota 20 14 17
15 volkswagen 16 11 13.5
Data %>% userfunction_exp("cyl==4",manufacturer)
# A tibble: 9 x 4
manufacturer `1999` `2008` mean
<chr> <dbl> <dbl> <dbl>
1 audi 4 4 4
2 chevrolet 1 1 1
3 dodge 1 0 0.5
4 honda 5 4 4.5
5 hyundai 4 4 4
6 nissan 2 2 2
7 subaru 6 8 7
8 toyota 11 7 9
9 volkswagen 11 6 8.5
2021-04-01 14:55: edited to add some information and add some steps to the pipe for function with_group.
Hi this is a good question!
There are multiple ways to achieve this as the previous answers pointed out. One way to do it in the tidyverse is tidy evaluation
Omitting your filter function (which you could explain in more detail...)
my_summary <- function(df, grouping_var) {
grp_var <- enquo(grouping_var) #capture group variable
df %>% my_group_by(grp_var)
}
my_group_by <- function(df, grouping_var){
# Check if group is supplied
if(rlang::quo_is_missing(grouping_var)) {
df %>% without_group()
} else {
df %>% with_group(grouping_var)
}
}
without_group <- function(df) {
# do whatever without group
df %>%
count(year)
}
with_group <- function(df, grouping_var) {
# do whatever with group
df %>%
group_by(!!grouping_var) %>% #Note the !!
count(year) %>%
pivot_wider(names_from=year, values_from=n)
}
Which will give you without any argument
> mpg %>% my_summary()
# A tibble: 2 x 2
year n
<int> <int>
1 1999 117
2 2008 117
With group passed to pipe
> mpg %>% my_summary(model)
# A tibble: 38 x 3
# Groups: model [38]
model `1999` `2008`
<chr> <int> <int>
1 4runner 4wd 4 2
2 a4 4 3
3 a4 quattro 4 4
4 a6 quattro 1 2
5 altima 2 4
6 c1500 suburban 2wd 1 4
7 camry 4 3
8 camry solara 4 3
9 caravan 2wd 6 5
10 civic 5 4
# ... with 28 more rows
I don't know what is the use of Filter argument so I'll keep it as it is for now.
group_by(A) %>% count(B) is same as count(A, B) so you can change your function to :
library(tidyverse)
userfunction <- function(DF,Filter,Group = NULL) {
DF %>%
count(year, {{Group}}) %>%
pivot_wider(names_from=year, values_from=n)
}
Data %>% userfunction(FALSE)
# `1999` `2008`
# <int> <int>
#1 117 117
Data %>% userfunction(FALSE,manufacturer)
# A tibble: 15 x 3
# manufacturer `1999` `2008`
# <chr> <int> <int>
# 1 audi 9 9
# 2 chevrolet 7 12
# 3 dodge 16 21
# 4 ford 15 10
# 5 honda 5 4
# 6 hyundai 6 8
# 7 jeep 2 6
# 8 land rover 2 2
# 9 lincoln 2 1
#10 mercury 2 2
#11 nissan 6 7
#12 pontiac 3 2
#13 subaru 6 8
#14 toyota 20 14
#15 volkswagen 16 11
Note that I have assigned the default value to Group as NULL so when you don't mention anything it ignores that argument.
I believe this may have a simple solution but I'm having trouble describing what I need to do (and hence what to search for). I think I need the summarize function. My goal output is at the very bottom.
I'm trying to count the occurrences of a value between each unique value in another column. Here is an example df that hopefully illustrates what I need todo.
library(dplyr)
set.seed(1)
df <- tibble("name" = c(rep("dinah",2),rep("lucy",4),rep("sora",9)),
"meal" = c(rep(c("chicken","beef","fish"),5)),
"date" = seq(as.Date("1999/1/1"),as.Date("2000/1/1"),25),
"num.wins" = sample(0:30)[1:15])
Among other things, I'm trying to summarize (sum) the types of meals each name had using this data.
df
# A tibble: 15 x 4
name meal date num.wins
<chr> <chr> <date> <int>
1 dinah chicken 1999-01-01 8
2 dinah beef 1999-01-26 11
3 lucy fish 1999-02-20 16
4 lucy chicken 1999-03-17 25
5 lucy beef 1999-04-11 5
6 lucy fish 1999-05-06 23
7 sora chicken 1999-05-31 27
8 sora beef 1999-06-25 15
9 sora fish 1999-07-20 14
10 sora chicken 1999-08-14 1
11 sora beef 1999-09-08 4
12 sora fish 1999-10-03 3
13 sora chicken 1999-10-28 13
14 sora beef 1999-11-22 6
15 sora fish 1999-12-17 18
I've made progress with other calculations I'm interested in, below:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins))
# A tibble: 3 x 5
name count medianDate life wins
<chr> <int> <date> <time> <int>
1 dinah 2 1999-01-13 25 days 19
2 lucy 4 1999-03-29 75 days 69
3 sora 9 1999-09-08 200 days 101
My goal is to add an additional column for each type of food, and have the sum of the occurrences of that food displayed in each row, like so:
name count medianDate life wins chicken beef fish
1 dinah 2 1999-01-13 25 days 19 1 1 0
2 lucy 4 1999-03-29 75 days 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3
Though older, and possibly on a deprecation path, reshape2::dcast does this nicely:
reshape2::dcast(df, name ~ meal)
# name beef chicken fish
# 1 dinah 1 1 0
# 2 lucy 1 1 2
# 3 sora 3 3 3
You can understand the formula as rows ~ columns. By default, it will aggregate the values in the columns using the length function---which gives exactly what you want, the count of each.
This can be easily joined to your summary data:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins)) %>%
left_join(reshape2::dcast(df, name ~ meal))
# # A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <int> <int> <int>
# 1 dinah 2 1999-01-13 25 days 19 1 1 0
# 2 lucy 4 1999-03-29 75 days 69 1 1 2
# 3 sora 9 1999-09-08 200 days 101 3 3 3
One option is to use table inside summarise as a list column, unnest and then spread it to 'wide' format
library(tidyverse)
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
n = list(enframe(table(meal))) ) %>%
unnest %>%
spread(name1, value, fill = 0)
# A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <dbl> <dbl> <dbl>
#1 dinah 2 1999-01-13 25 days 19 1 1 0
#2 lucy 4 1999-03-29 75 days 69 1 1 2
#3 sora 9 1999-09-08 200 days 101 3 3 3
I'm not entirely sure why I'm getting the funky formatting for life, but I think this gets at your need for a count of the meal types.
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
chicken = sum(meal == "chicken"),
beef = sum(meal == "beef"),
fish = sum(meal == "fish"))
# A tibble: 3 x 8
name count medianDate life wins chicken beef fish
<chr> <int> <date> <time> <int> <int> <int> <int>
1 dinah 2 1999-01-13 " 25 days" 19 1 1 0
2 lucy 4 1999-03-29 " 75 days" 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3