Let's say I want summarize a certain data frame column:
> starwars %>% count(eye_color)
# A tibble: 15 x 2
eye_color n
<chr> <int>
1 black 10
2 blue 19
3 blue-gray 1
4 brown 21
5 dark 1
6 gold 1
7 green, yellow 1
8 hazel 3
9 orange 8
10 pink 1
11 red 5
12 red, blue 1
13 unknown 3
14 white 1
15 yellow 11
There are a lot of small categories, such as "blue-gray" or "pink". I would like to merge them all into "other". There is a multi-step process to do this:
starwars %>%
add_count(eye_color) %>%
mutate(eye_color = if_else(n < 5, "other", eye_color)) %>%
count(eye_color)
There is also a way to do it with a single command. I saw this trick before somewhere, but now cannot find it.
Writing up #Jordan's suggestion:
Updated: with Camille's fix:
starwars %>% mutate(eye_color_grp = as.factor(eye_color) %>%
forcats::fct_lump_min(min = 5, other_level = "Other")) %>%
count(eye_color_grp, sort = TRUE)
Link:https://forcats.tidyverse.org/reference/fct_lump.html
Related
I'm working with a "movie" dataset. I have a movie "title" column (col no 1) and a "overall_score" column (col no 13).
Apparently multiple movies has scored 10, so when I make the top 10, it only shows me all movie with score 10.
But I only want the score 10, 9, 8 and so on until 1 to appear only 3 times. I tired using the slice function but wasn't successful in that, what do you think I'm doing wrong?
Here's my code:
movie2 <- movie_reviews %>%
arrange(desc(Overall)) %>%
group_by(uid, title) %>%
head(10) %>% slice(13:3)
If you don't care about which movies are within the score subgroups, then you could just use row_number to assign a unique number per Overall group.
library(dplyr)
set.seed(1)
movie_reviews <- data.frame(
uid = 1:100,
title = paste("title", 1:100),
Overall = sample(1:10, 100, replace=T)
)
movie2 <- movie_reviews %>%
group_by(Overall) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
filter(rn < 4)%>%
select(-rn) %>%
arrange(Overall)
> movie2
# A tibble: 30 × 4
uid title Overall rn
<int> <chr> <int> <int>
1 4 title 4 1 3
2 9 title 9 1 2
3 64 title 64 1 1
4 23 title 23 2 1
5 82 title 82 2 2
6 87 title 87 2 3
7 8 title 8 3 3
8 57 title 57 3 2
9 80 title 80 3 1
10 27 title 27 4 1
# … with 20 more rows
I am trying to deal with some aggregated data. I would like to have the data in a tidy format, but I am not sure how to do this without ending up with a number of value variables. What is the correct way to organize this data? I have searched around but can't find anything.
Here is an example:
#create the dataframe
df <- data.frame('date' = seq(as.Date('2019-01-15'), as.Date('2019-04-15'), 'months'),
'total' = c(2, 4, 1, 6),
'age.0-6' = c(1, 4, 0, 3),
'age.7-12' = c(1, 0, 1, 3),
'race.white' = c(1, 2, 0, 2),
'race.black' = c(1, 2, 1, 2),
'race.other' = c(0, 0, 1, 2))
#print the dataframe
df
date total age.0_6 age.7_12 race.white race.black race.other
1 2019-01-15 2 1 1 1 1 0
2 2019-02-15 4 4 0 2 2 0
3 2019-03-15 1 0 1 0 1 1
4 2019-04-15 6 3 3 2 2 2
The problem here is that i don't know the individual categories as the data is all aggregated. For example, for April 2014, I don't know if the races for ages 0-6 are:
2 other and 1 white; or
2 white and 1 black; or
1 black, 1 white and 1 other.
Because of this I can't get unique columns for each variable with one value for each outcome. So I can't tidy in the usual way.
Instead, I can tidy age and race, and have value columns for each. The first easy problem is to change the name of the value variable, but the bigger problem remains that I have lots of variables each with a value equivalent.
Here is a quick example:
df %>%
pivot_longer(c(age.0_6, age.7_12), names_to = 'age') %>% #pivot age data
mutate(age = gsub('[a-z]+\\.', '', age)) %>% #clean the age variable
pivot_longer(c(race.white, race.black, race.other), names_to = 'race', values_to = 'count') %>% #pivot the race data (use 'count' instead of 'value'
mutate(race = gsub('[a-z]+\\.', '', race)) #clean the race data
# A tibble: 24 x 6
date total age value race count
<date> <dbl> <chr> <dbl> <chr> <dbl>
1 2019-01-15 2 0_6 1 white 1
2 2019-01-15 2 0_6 1 black 1
3 2019-01-15 2 0_6 1 other 0
4 2019-01-15 2 7_12 1 white 1
5 2019-01-15 2 7_12 1 black 1
6 2019-01-15 2 7_12 1 other 0
7 2019-02-15 4 0_6 4 white 2
8 2019-02-15 4 0_6 4 black 2
9 2019-02-15 4 0_6 4 other 0
10 2019-02-15 4 7_12 0 white 2
# ... with 14 more rows
This is clearly not a tidy format and the data is pretty unmanageable. The problem rapidly becomes huge when I have a large number of age brackets, a large number of race categories, and a host of other aggregated characteristics: gender, disability, income bracket etc. etc.
Any thoughts on the best way to organize data of this sort? I am assuming it is common enough and there is best practice.
I think you have a few options that might make sense, depending on how you want to use the data. For visualizing the data, I think it's enough to just pivot the whole thing longer (#1 below). For analysis within each dimension, it might be safest and least presumptuous to keep them as separate tables (#2), since as you noted there are a huge number of ways the dimensions could conceivably relate to each other. If you want to show all the dimensions together, you will need to make assumptions about how the dimensions relate to each other. In #3 I assume the dimensions are completely uncorrelated, but in real samples this is rarely the case, and may lead to incorrect conclusions. (e.g. see examples of Simpson's Paradox)
Make dimension a variable in longer table
Here we just make the dimension of data (total / race / age) one column, and the value another.
library(tidyverse)
long_all <- df %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
This might make sense if you want to go right to visualization, where you could either filter by dimension or assign them to facets:
ggplot(long_all, aes(category, value)) +
geom_col() +
facet_wrap(~dimension, scales = "free_x" )
Make into multiple tables
You don't know how the dimensions relate to each other, so one clean method would be to keep them distinct. Then we could analyze each separately with a table focused on that dimension.
race <- df %>%
select(date, contains("race")) %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
age <- df %>%
select(date, contains("age")) %>%
pivot_longer(-date) %>%
separate(name, c("dimension", "category"),
fill = "right", extra = "merge")
Impute hypothetical individuals
If you need to include both dimensions, you will have to make assumptions about how they relate. You might posit, for instance, that race and age are perfectly independent of each other in the sample (this is likely a faulty assumption, so should be noted). To create hypothetical crosstabs this way, you could create hypothetical individuals and have each sample without replacement from the various ages and races. The result will be one possibility of how the original summary data could have arisen, but might well omit patterns that exist in the true underlying data.
set.seed(42)
shuffle_step <- function(df) {
df %>%
uncount(value) %>%
slice_sample(prop = 1, replace = FALSE) %>%
group_by(date) %>%
mutate(row_in_date = row_number()) %>%
ungroup()
}
imputed_individuals <- full_join(
age %>%
shuffle_step %>%
select(date, row_in_date, age = category),
race %>%
shuffle_step %>%
select(date, row_in_date, race = category),
by = c("date", "row_in_date"))
Here, I make a row for each individual within each date with a possible category value, either for race or age. Then we join the two resulting data sets together, giving one possible set of individuals who would produce the same summary stats we started with, assuming the dimensions are uncorrelated.
We see here that there is one more individual who was assigned race than the ones who were counted by age or total dimensions. They show up with NA age here at the bottom of the list. It's likely a typo, but such data misalignment can be common in real-world data collection, so it's good practice to accommodate the possibility for inconsistent values.
> imputed_individuals
# A tibble: 14 x 4
date row_in_date age race
<date> <int> <chr> <chr>
1 2019-02-15 1 0.6 black
2 2019-04-15 1 0.6 black
3 2019-01-15 1 0.6 black
4 2019-04-15 2 7.12 black
5 2019-04-15 3 0.6 other
6 2019-02-15 2 0.6 white
7 2019-04-15 4 7.12 white
8 2019-04-15 5 0.6 other
9 2019-01-15 2 7.12 white
10 2019-02-15 3 0.6 white
11 2019-02-15 4 0.6 black
12 2019-03-15 1 7.12 other
13 2019-04-15 6 7.12 white
14 2019-03-15 2 NA black
We can confirm that this hypothetical scenario is consistent with our original data:
long_all %>%
filter(dimension == "age") %>%
left_join(
imputed_individuals %>% count(date, age),
by = c("date", "category" = "age"))
# A tibble: 8 x 5
date dimension category value n
<date> <chr> <chr> <dbl> <int>
1 2019-01-15 age 0.6 1 1
2 2019-01-15 age 7.12 1 1
3 2019-02-15 age 0.6 4 4
4 2019-02-15 age 7.12 0 NA
5 2019-03-15 age 0.6 0 NA
6 2019-03-15 age 7.12 1 1
7 2019-04-15 age 0.6 3 3
8 2019-04-15 age 7.12 3 3
long_all %>%
filter(dimension == "race") %>%
left_join(
imputed_individuals %>% count(date, race),
by = c("date", "category" = "race"))
# A tibble: 12 x 5
date dimension category value n
<date> <chr> <chr> <dbl> <int>
1 2019-01-15 race white 1 1
2 2019-01-15 race black 1 1
3 2019-01-15 race other 0 NA
4 2019-02-15 race white 2 2
5 2019-02-15 race black 2 2
6 2019-02-15 race other 0 NA
7 2019-03-15 race white 0 NA
8 2019-03-15 race black 1 1
9 2019-03-15 race other 1 1
10 2019-04-15 race white 2 2
11 2019-04-15 race black 2 2
12 2019-04-15 race other 2 2
Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow
I have a list of data frames:
df1 <- data.frame(one = c('red','blue','green','red','red','blue','green','green'),
one.1 = as.numeric(c('1','1','0','1','1','0','0','0')))
df2 <- data.frame(two = c('red','yellow','green','yellow','green','blue','blue','red'),
two.2 = as.numeric(c('0','1','1','0','0','0','1','1')))
df3 <- data.frame(three = c('yellow','yellow','green','green','green','white','blue','white'),
three.3 = as.numeric(c('1','0','0','1','1','0','0','1')))
all <- list(df1,df2,df3)
I need to group each data frame by the first column and summarise the second column.
Individually I would do something like this:
library(dplyr)
df1 <- df1 %>%
group_by(one) %>%
summarise(sum = sum(one.1))
However I'm having trouble figuring out how to iterate over each item in the list.
I've thought of using a loop:
for(i in 1:3){
all[i] <- all[i] %>%
group_by_at(1) %>%
summarise()
}
But I can't figure out how to specify a column to sum in the summarise() function (this loop is likely wrong in other ways than that anyway).
Ideally I need the output to be another list with each item being the summarised data, like so:
[[1]]
# A tibble: 3 x 2
one sum
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two sum
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three sum
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Would really appreciate any help!
Using purrr::map and summarise at columns contain a letteral dot \\. using matches helper.
library(dplyr)
library(purrr)
map(all, ~.x %>%
#group_by_at(vars(matches('one$|two$|three$'))) %>% #column ends with one, two, or three
group_by_at(1) %>%
summarise_at(vars(matches('\\.')),sum))
#summarise_at(vars(matches('\\.')),list(sum=~sum))) #2nd option
[[1]]
# A tibble: 3 x 2
one one.1
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two two.2
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three three.3
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Here's a base R solution:
lapply(all, function(DF) aggregate(list(added = DF[, 2]), by = DF[, 1, drop = F], FUN = sum))
[[1]]
one added
1 blue 1
2 green 0
3 red 3
[[2]]
two added
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
three added
1 blue 0
2 green 2
3 white 1
4 yellow 1
Another approach would be to bind the lists into one. Here I use data.table and avoid using the names. The only problem is that this may mess up factors but I'm not sure that's an issue in your case.
library(data.table)
rbindlist(all, use.names = F, idcol = 'id'
)[, .(added = sum(one.1)), by = .(id, color = one)]
id color added
1: 1 red 3
2: 1 blue 1
3: 1 green 0
4: 2 red 1
5: 2 yellow 1
6: 2 green 1
7: 2 blue 1
8: 3 yellow 1
9: 3 green 2
10: 3 white 1
11: 3 blue 0
This is the shor example data. Original data has many columns and rows.
head(df, 15)
ID col1 col2
1 1 green yellow
2 1 green blue
3 1 green green
4 2 yellow blue
5 2 yellow yellow
6 2 yellow blue
7 3 yellow yellow
8 3 yellow yellow
9 3 yellow blue
10 4 blue yellow
11 4 blue yellow
12 4 blue yellow
13 5 yellow yellow
14 5 yellow blue
15 5 yellow yellow
what I want to count how many different colors in col2 including the color of col1. For ex: for the ID=4, there is only 1 color in col2. if we include col1, there are 2 different colors. So output should be 2 and so on.
I tried in this way, but it doesn't give me my desired output: ID = 4 turns into 0 which is not I want. So how could I tell R to count them including color in col1?
out <- df %>%
group_by(ID) %>%
mutate(N = ifelse(col1 != col2, 1, 0))
My desired output is something like this:
ID col1 count
1 green 3
2 yellow 2
3 yellow 2
4 blue 2
5 yellow 2
You can do:
df %>%
group_by(ID, col1) %>%
summarise(count = n_distinct(col2))
ID col1 count
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
Or even:
df %>%
group_by(ID, col1) %>%
summarise_all(n_distinct)
ID col1 col2
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
To group by every three rows:
df %>%
group_by(group = gl(n()/3, 3), col1) %>%
summarise(count = n_distinct(col2))