I'm working with some survey data and I want to summarize responses from everyone, and responses from members in a single table.
The best way I can think of to translate this to Starwars is that I want to know how many characters total have any one eye color, and how many female characters have that eye color. For simplicity, I limited the population to blue and brown eyes.
I can run to separate queries, one to show just the females:
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
count(eye_color, gender) %>%
filter(gender == "female") %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
And one to show all characters regardless of gender:
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
count(eye_color) %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
But I'd like to spit these out as a single table. Is there a better approach to that than just pasting the two resulting tibbles together?
I still don't know of a good way to group by data where groups overlap in dplyr without repeating data. So I think combing the data data from two different pipelines is fine. If you want to elimiated code duplication, you could write a helper function. Here's one such example
plus_margin <- function(data, filters, fun=identity, .id="id") {
stopifnot(is.list(filters))
stopifnot(!is.null(names(filters)))
stopifnot(all(sapply(filters, is.function)))
stopifnot(is.function(fun))
bind_rows(
map_dfr(filters, ~data %>% .x %>% fun, .id=.id),
data %>% fun %>% mutate(.id:="all")
)
}
Then you could call it with something like
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
plus_margin(list(
feminine = . %>% filter(gender == "feminine")
),
. %>% count(eye_color) %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
)
Which returns
id eye_color n percent
<chr> <chr> <int> <chr>
1 feminine blue 6 55%
2 feminine brown 5 45%
3 all blue 19 48%
4 all brown 21 52%
The idea is that you pass in a list of filters to subset the data by. These filters, should be functions that take data and subset it in some way. The list should be named and the names will be used as values in the resulting "id" column. Here we use the magrittr syntax . %>% {} to create an anonymous function. We the need to pass in a function apply to each of the subsets.
But at the end of the day, the joining is still happening with bind_rows. Maybe someone else will suggest a better way.
Related
I have a fairly alrge dataset and I am running a for loop to remove one line per transect and calculate the frequency of the category. I am now trying to make it so that instead of one line per transect it removes a whole transect every iteration. Is it possible to do this?
Here is a sample dataset with the same columns I have
Transect<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
Category<- c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C")
dat<- data.frame(Transect,Category)
So the current code below removes one line per transect. How could I do it so that it randomly deletes a whole transect category (i.e. in the first iteration all of transect 3 is removed and on the second all of 1 is removed)
for (q in 1:2) {
for ( i in 0:5){
#if (i>0)
df<- dat2 %>%
group_by(Transect) %>%
sample_n(n() - i, replace = TRUE) %>%
ungroup()
c<-df %>%
group_by(Category) %>%
summarise(n = n(), replace=TRUE) %>%
mutate(freq = n / sum(n),
total=55-i)
if (i==0){
tot_1=c
} else {
tot_1=bind_rows(tot_1,c)
}
}
tot_1$rep = q
if (q==1){
dftot = tot_1
} else {
dftot=bind_rows(dftot, tot_1)
}
}
It seems your goals is to iteratively assess increasingly small subsamples of your data to assess loss of representation of the whole. This code will try dropping a random 1 then 2 then 3... and report the distribution of categories. The last few lines normalize count to fraction of total for easy comparison between iterations.
Note I used set.seed() because it will return a different result each time due to random sampling.
To break down this answer a bit:
It's important that Category is a factor so that table() won't drop any Category values that have no count in a particular iteration. It would run to a point but then the rowbinding operation within map_dfr() would fail.
First I just enumerate the numbers of Transect to leave out (should be 0:4 in this example) using 0:length(unique(d$Transect)). I included 0 so that we can see what it looks like with the full dataset.
I used set_names() so that it becomes a named vector. This allows us to use .id inside map_dfr() so that we get an extra column which stores the value of the leaveout.
purrr::map_dfr() will iteratively apply a function over some list. In this case I piped in the list of leaveout values (which we just named) and the function we apply is given as an rlang-style lambda function which begins with ~ and operates on the argument .x.
Working from the inside of the filter operation, this function first randomly samples a number of values of Transect to exclude given by .x and then removes data with said value of Transect. Here we use %in% and negate the whole result with ! at the beginning.
Then we just use dplyr::pull() to take the Category column as a vector and run table() on it to tabulate the occurrence of each value.
The rest just calculates the total count for each iteration and then divides the values by that to get a fraction.
library(tidyverse)
d <- tibble(
Transect = as.character(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)),
Category = factor(c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C"))
)
set.seed(1)
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr( ~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects") %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~.x/total_count))
#> # A tibble: 4 × 6
#> # Rowwise:
#> leaveout_transects total_count C S SP T
#> <chr> <int> <table> <table> <table> <table>
#> 1 0 15 0.2 0.4 0.1333333 0.2666667
#> 2 1 10 0.2 0.3 0.2000000 0.3000000
#> 3 2 5 0.2 0.2 0.2000000 0.4000000
#> 4 3 0 NaN NaN NaN NaN
It would probably be more rigorous to simulate each leaveout condition multiple times and look at the distribution of performance you get at each value to assess what's likely to happen in the future with a given subsample.
Base r has the built in function replicate which is great for this purpose. Here I'm just using the code above with replicate and then reformatting the data a bit to graph it.
# use replicate to make many simulations
n_reps <- 20
replicate(
n_reps,
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr(
~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects"
) %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~ .x / total_count)) %>%
select(3:6) %>%
t() %>%
cor() %>%
.[, 1]) %>%
as_tibble(.name_repair = "unique") %>%
mutate("leavout_transects" = factor(0:length(unique(d$Transect)))) %>%
pivot_longer(-leavout_transects, values_to = "correlation") %>%
select(-name) %>%
ggplot(aes(leavout_transects, correlation)) +
geom_boxplot()
Created on 2022-09-22 by the reprex package (v2.0.1)
I have a long dataset in which there are duplicated entries whose data I need to merge, e.g. paste values together.
In my case, I have a database of scientific articles: the strongest unique identifiers are the DOI and the article title, but the first may be missing in one of the copies, and the second may have slight phonetic/graphic differences that are easy to spot for humans but not programmatically (e.g. one copy uses β and the other plain beta).
A "match" are two articles that share at least one of the two columns. That is, I need a way to dplyr::group_by by the DOI OR the article title (usual group_by uses an AND logic).
The only solution that comes to my mind is to repeat the aggregation twice, for each column. Not very efficient given the large number of records.
Example:
imagine an input like:
df <- data.frame(
ID = c(1, NA, 2, 2),
Title = c('A', 'A', 'beta', 'β'),
to.join = 1:4
)
After (OR)grouping and summarising:
df %>%
group_by_OR(ID, Title) %>% # dummy function
summarise(
ID = na.omit(ID)[1],
Title = Title[1],
joined = paste(to.join, collapse = ', '))
I should get something like this:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
That is, the data was grouped by the title for the first group and by the id for the second.
I don't think you can avoid having to group the data twice, but we can do it sequentially, that way we can be as efficient as possible.
library(dplyr)
df_aggregated <- df %>%
group_by(ID) %>%
arrange(Title) %>%
summarise(Title = first(Title),
to.join = paste0(to.join, collapse=", ")) %>%
group_by(Title) %>%
arrange(ID) %>%
summarise(ID = first(ID),
to.join = paste0(to.join, collapse=", ")) %>%
select(ID, Title, joined=to.join) %>%
as.data.frame()
Now,
df_aggregated
is:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
Eventually I found a solution, thanks also to #dario.
First I group by Title and impute the missing DOIs if at least one of the copies has one. Then I ungroup and create a new unique ID, using the DOI if present and the Title for those entries whose no copies have it.
Finally I group and summarize by this ID.
This way the computational-heavy summarising step is done only once.
records %>%
mutate(
uID = str_to_lower(Title) %>% str_remove_all('[^\\w\\d]+') # Improve matching between slightly different copies
) %>%
group_by(uID) %>%
mutate(DOI = na.omit(DOI)[1]) %>%
ungroup() %>%
mutate(
uID = ifelse(is.na(DOI), uID, DOI)
) %>%
group_by(uID) %>%
summarise(...) # various stuff here.
I have a task that I can't solve. My goal is to be able to figure out how many "families" have children (under 18). I only need the sum of unique familyids and I've tried doing it in R and Excel and can't figure it out.
In my data I have four families and my data is saved on a client level.
data <- data.frame(
"FamilyID" = c(10,10,10,11,11,11,12,12,13,13),
"ClientID" = c(101,102,103,111,112,113,121,122,131,132),
"Age" = c(26,1,5,35,34,1,54,60,17,21)
)
My goal is to have something like this
Metric Count
Families w/ Children 3
Families w/out Children 1
In my actual dataset I have thousands of families so I really appreciate ant help.
How can I do this with dplyr?
library(tidyverse)
counts <- data %>%
group_by(FamilyID) %>%
summarise(number_of_children = sum(Age<= 18), number_of_adults = sum(Age > 18)) %>%
ungroup()
final <- counts %>%
summarise("Families w/ children" = sum(number_of_children > 0), "Families w/o children" = sum(number_of_children < 1)) %>%
gather() %>%
rename("Metric" = key, "Count" = value)
You can try something like this:
data2 <- data %>% group_by(FamilyID) %>%
mutate(children=sum(Age<18)) %>% mutate(children=ifelse(children>=1,1,0))
data2 %>% group_by(children) %>% summarize(n_distinct(FamilyID))
which shows how many distinct Family IDs correspond to 0 children, and how many correspond to at least 1 child.
One option is to use any to distinguish which families have children TRUE/FALSE, followed by dplyr::count
library(dplyr)
data %>%
group_by(FamilyID) %>%
summarize(have_children = any(Age < 18)) %>%
count(have_children)
#------
have_children n
<lgl> <int>
1 FALSE 1
2 TRUE 3
I am quite new to R and struggling with subsetting datasets.
This is where the dataset came from and how I clean it.
board_game_original<- read.csv("https://raw.githubusercontent.com/bryandmartin/STAT302/master/docs/Projects/project1_bgdataviz/board_game_raw.csv")
#tidy up the column of mechanic and category with cSplit function
library(splitstackshape)
mechanic <- board_game$mechanic
board_game_tidy <- cSplit(board_game,splitCols=c("mechanic","category"), sep = ",", direction = "long")
here's my code trying to extract two columns: category, and average complexities.
summary_category <- summary(board_game_tidy$category)
top_5_category <- summary_category[1:5]
complexity_top_5_category <- board_game_tidy %>%
group_by(category) %>%
select(average_complexity) %>%
filter(category == c("Abstract Strategy Action / Dexterity", "Adventure", "Age of Reason","American Civil War "))
complexity_top_5_category
My final intent: create a data frame with only 2 columns: category and average complexities, and take a mean of the average complexities under the same category name.
What I encountered: I have 5 rows of category, but 30 rows of average complexities. What can I do to take a mean value of all the average complexities under the same category names? All help will be appreciated! Thank you!
filter the values for top 5 category, then group_by category and take mean of average_complexity.
library(dplyr)
board_game_tidy %>%
filter(category %in% names(top_5_category)) %>%
group_by(category) %>%
summarise(average_complexity = mean(average_complexity))
# category average_complexity
# <fct> <dbl>
#1 Abstract Strategy 0.844
#2 Action / Dexterity 0.469
#3 Adventure 1.25
#4 Age of Reason 1.95
#5 American Civil War 1.68
You are very close. You need dplyr::summarise()
complexity_top_5_category <- board_game_tidy %>%
group_by(category) %>%
dplyr::summarise(mean_average_complexity = mean(average_complexity, na.rm=TRUE)) %>%
top_n(5, mean_average_complexity)
#select(average_complexity) %>% # you don't need this
#filter(category == c("Abstract Strategy Action / Dexterity", "Adventure", "Age of Reason","American Civil War "))
complexity_top_5_category
You don't have to include dplyr:: before summarise(). However, some other common packages have their versions of summarise() so it's safer to be specific.
You can use top_n() to automatically select the top n categories, instead of using filter().
Okay so I have data as so:
ID Name Job
001 Bill Carpenter
002 Wilma Lawyer
003 Greyson Lawyer
004 Eddie Janitor
I want to group these together for analysis so any job that appears less than x percent of the whole will be grouped into "Other"
How can I do this, here is what I tried:
df %>%
group_by(Job) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(count)) %>%
drop_na()
And now I know what the percentages are but how do I integrate this in to the original data to make everything below X "Other". (let's say less than or equal to 25% is other).
Maybe there's a more straightforward way....
You can try this :
library(dplyr)
df %>%
count(Job) %>%
mutate(n = n/sum(n)) %>%
left_join(df, by = 'Job') %>%
mutate(Job = replace(Job, n <= 0.25, 'Other'))
To integrate our calculation in original data we do a left_join and then replace the values.