How does ggplot2 evaluate factor levels? - r

Consider this example dataframe:
library(tidyverse)
df <- tibble(item = c("Banana", "Ananas", "Apple", "Blueberry", "Orange",
"Spinach", "Cabbage", "Broccoli", "Carrot", "Eggplant"),
category = c(rep("Fruit", 5), rep("Vegetable", 5)),
n = c(57, 19, 14, 11, 8, 318, 70, 33, 31, 23))
First, I attribute factor levels to item by n. I also add a variable item_fct in order to display the factor levels of item in the output table for a better understanding of my problem:
df %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 8
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 7
9 Carrot Vegetable 31 6
10 Eggplant Vegetable 23 5
This output table and the factor levels make sense to me: Item is labeled from 1-10 by n in an ascending order.
But if I group it by category, I get the following output:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
# Groups: category [2]
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 5
7 Cabbage Vegetable 70 4
8 Broccoli Vegetable 33 3
9 Carrot Vegetable 31 2
10 Eggplant Vegetable 23 1
Here, the factor levels are rather confusing to me. The items, depending on the group, share the same factor levels (1-5). I would still have expected levels from 1-10, but ordered by n depending on the respective group. Like this:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6
My Question:
Ultimately, it's hard for me to understand why ggplot2 produces the following plot with this code:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item)) %>%
ggplot(aes(n, item, fill = category)) +
geom_col()
If the items are factored from 1-5 in their respective groups and thus share the same overall factor levels, then why does ggplot2 display the bars by the grouped variable?
This plot is actually my desired outcome, but I don't understand the logic behind it. Considering the factor levels, I would expect bars of the same factor level to be next to each other (basically starting with "Eggplant", then "Orange", then "Carrot", then "Blueberry" etc.).
I would've assumed that the following table would produce the plot, because here the factor levels are ordered in the same way the bars are ordered in the plot:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6

Related

Creating multiple frequency count tibbles at once in R

I have data on 30 people that includes ethnicity, gender, school type, whether they received free school meals, etc.
I want to produce frequency counts for all of these features. Currently my code looks like this:
df <- read.csv("~file")
df %>% select(Ethnicity) %>% group_by(Ethnicity) %>% summarise(freq = n())
df %>% select(Gender) %>% group_by(Gender) %>% summarise(freq = n())
df %>% select(School.type) %>% group_by(School.type) %>% summarise(freq = n())
Is there a way I can create a frequency tibble for 8 columns (e.g. ethnicity, gender, school type, etc.) in a more efficient way (e.g. 1 or 2 lines of code)?
As an example output for the ethnicity code:
# A tibble: 13 × 2
Ethnicity freq
<chr> <int>
1 Asian or Asian British - Bangladeshi 1
2 Asian or Asian British - Indian 7
3 Asian or Asian British - Pakistani 1
4 Black or Black British - African 5
5 Black or Black British - Caribbean 2
6 Chinese 3
7 Mixed - White and Asian 2
8 Mixed - White and Black African 1
9 Mixed - White and Black Caribbean 1
10 Not known/ prefer not to say 1
11 White British 27
12 White Irish 1
13 White Other 5
And for gender:
# A tibble: 2 × 2
Gender freq
<chr> <int>
1 Female 36
2 Male 21
NB: some columns also contain data on postcode & name which I obviously don't want to perform the frequency function on, so I think I'll somehow need to select just the columns I want to perform this function on
One option would be to use lapply to loop over a vector of your desired columns and dplyr::count for the frequency table.
Using the starwars dataset as example data:
library(dplyr, warn = FALSE)
cols <- c("hair_color", "sex")
lapply(cols, function(x) {
count(starwars, .data[[x]], name = "freq")
})
#> [[1]]
#> # A tibble: 13 × 2
#> hair_color freq
#> <chr> <int>
#> 1 auburn 1
#> 2 auburn, grey 1
#> 3 auburn, white 1
#> 4 black 13
#> 5 blond 3
#> 6 blonde 1
#> 7 brown 18
#> 8 brown, grey 1
#> 9 grey 1
#> 10 none 37
#> 11 unknown 1
#> 12 white 4
#> 13 <NA> 5
#>
#> [[2]]
#> # A tibble: 5 × 2
#> sex freq
#> <chr> <int>
#> 1 female 16
#> 2 hermaphroditic 1
#> 3 male 60
#> 4 none 6
#> 5 <NA> 4

How to use pivot_longer() in R to separate columns into multiple rows by category?

Here is some fictional data:
tibble(fruit = rep(c("apple", "pear", "orange"), each = 3),
size = rep(c("big", "medium", "small"), times = 3),
# summer stock
shopA_summer_wk1 = abs(round(rnorm(9, 10, 5), 0)),
shopA_summer_wk2 = abs(round(rnorm(9, 10, 5), 0)),
shopB_summer_wk1 = abs(round(rnorm(9, 10, 5), 0)),
shopB_summer_wk2 = abs(round(rnorm(9, 10, 5), 0)),
shopC_summer_wk1 = abs(round(rnorm(9, 10, 5), 0)),
shopC_summer_wk2 = abs(round(rnorm(9, 10, 5), 0)),
# winter stock
shopA_winter_wk1 = abs(round(rnorm(9, 8, 4), 0)),
shopA_winter_wk2 = abs(round(rnorm(9, 8, 4), 0)),
shopA_winter_wk3 = abs(round(rnorm(9, 8, 4), 0)),
shopB_winter_wk1 = abs(round(rnorm(9, 8, 4), 0)),
shopB_winter_wk2 = abs(round(rnorm(9, 8, 4), 0)),
shopB_winter_wk3 = abs(round(rnorm(9, 8, 4), 0)),
shopC_winter_wk1 = abs(round(rnorm(9, 8, 4), 0)),
shopC_winter_wk2 = abs(round(rnorm(9, 8, 4), 0)),
shopC_winter_wk3 = abs(round(rnorm(9, 8, 4), 0)))
Some data is collected for 3 shops (A, B, C) across 2 weeks in the summer and 3 weeks in the winter. The data collected is the number of fruits (apple, pear, orange) per size (big, medium, small) the shop had in stock on that particular week.
Here are the first 6 rows of of the dataset:
# fruit size shopA_summer_wk1 shopA_summer_wk2 shopB_summer_wk1 shopB_summer_wk2 shopC_summer_wk1 shopC_summer_wk2 shopA_winter_wk1 shopA_winter_wk2 shopA_winter_wk3
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 apple big 9 12 12 16 15 5 14 4 0
# 2 apple medium 21 16 16 1 12 11 8 8 9
# 3 apple small 10 6 18 18 22 12 4 2 0
# 4 pear big 13 7 4 12 13 6 10 6 2
# 5 pear medium 13 12 8 0 8 5 11 7 3
# 6 pear small 16 18 4 3 13 8 7 5 0
I would like to use the pivot_longer() function in R to restructure this dataset. Given that there are quite a few group categories I'm having difficulty in writing the code for this.
I would like it to look something like the following:
I would greatly appreciate any input :)
Using the names_pattern argument, we can do:
pivot_longer(df, c(-fruit, -size), names_pattern = '(^.*)_wk(.*$)',
names_to = c('Shop_season', 'week'))
#> # A tibble: 135 x 5
#> fruit size Shop_season week value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 apple big shopA_summer 1 11
#> 2 apple big shopA_summer 2 8
#> 3 apple big shopB_summer 1 4
#> 4 apple big shopB_summer 2 24
#> 5 apple big shopC_summer 1 9
#> 6 apple big shopC_summer 2 10
#> 7 apple big shopA_winter 1 9
#> 8 apple big shopA_winter 2 12
#> 9 apple big shopA_winter 3 5
#> 10 apple big shopB_winter 1 5
#> # ... with 125 more rows
You might also want to separate shop and season, since these are really two different variables:
pivot_longer(df, c(-fruit, -size), names_pattern = '(^.*)_wk(.*$)',
names_to = c('Shop_season', 'week')) %>%
separate(Shop_season, into = c('shop', 'season'))
#> # A tibble: 135 x 6
#> fruit size shop season week value
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 apple big shopA summer 1 11
#> 2 apple big shopA summer 2 8
#> 3 apple big shopB summer 1 4
#> 4 apple big shopB summer 2 24
#> 5 apple big shopC summer 1 9
#> 6 apple big shopC summer 2 10
#> 7 apple big shopA winter 1 9
#> 8 apple big shopA winter 2 12
#> 9 apple big shopA winter 3 5
#> 10 apple big shopB winter 1 5
#> #... with 125 more rows
If data is dt, then
pivot_longer(
data = dt,
cols = -c(fruit:size),
names_to = c("shop_season", "week"),
names_pattern = "(.*)_(.*)"
)
Output:
# A tibble: 135 x 5
fruit size shop_season week value
<chr> <chr> <chr> <chr> <dbl>
1 apple big shopA_summer wk1 13
2 apple big shopA_summer wk2 12
3 apple big shopB_summer wk1 9
4 apple big shopB_summer wk2 9
5 apple big shopC_summer wk1 7
6 apple big shopC_summer wk2 17
7 apple big shopA_winter wk1 10
8 apple big shopA_winter wk2 17
9 apple big shopA_winter wk3 12
10 apple big shopB_winter wk1 8

Collapsing group of strings into one string using an if statement within a for loop in R

I have a dataframe with a column "Food."
dataframe <- data.frame(Color = c("red","red","red","red","red","blue","blue","blue","blue","blue","green","green","green","green","green","orange","orange","orange","orange","orange"),
Food = c("banana","apple","potato","orange","egg","strawberry","cheese","yogurt","kiwi","butter","kale","sugar","carrot","celery","radish","cereal","milk","blueberry","squash","lemon"), Count = c(2,5,4,8,10,7,5,6,9,11,1,8,5,3,7,9,2,3,6,4))
Every time a fruit appears I want to replace the name of the fruit with "fruit."
I've tried making a vector of the fruit names. Then I go through each row in the dataframe and where the string matches the fruit, I want to replace the fruit name with "fruit."
fruit_list <- c("banana","apple","orange","strawberry","kiwi","blueberry","lemon")
for (r in 1:nrow(dataframe)) {
for (i in 1:length(fruit_list)){
if (length(grep(fruit_list[i], dataframe$Food[r])) != 0) {
dataframe$Food[r] <- paste("fruit")
}
}
}
How do I use this general format so that dataframe$Food doesn't just end up filled with NA?
With dplyr:
library(dplyr)
ataframe %>%
mutate(Food=as.character(Food),
Food=ifelse(Food%in%fruit_list,"Fruit",Food))#can change to fruit
Result:
Color Food Count
1 red Fruit 2
2 red Fruit 5
3 red potato 4
4 red Fruit 8
5 red egg 10
6 blue Fruit 7
7 blue cheese 5
8 blue yogurt 6
9 blue Fruit 9
10 blue butter 11
11 green kale 1
12 green sugar 8
13 green carrot 5
14 green celery 3
15 green radish 7
16 orange cereal 9
17 orange milk 2
18 orange Fruit 3
19 orange squash 6
20 orange Fruit 4
Only R base:
dataframe$Food <-
sapply(dataframe$Food,
function(x,fruit_list) ifelse(x %in% fruit_list, "fruit", as.character(x) ),
fruit_list = fruit_list )
You don't necessarily need dplyr for this.
Just use:
dataframe$Food <- ifelse(dataframe$Food %in% fruit_list, "Fruit", as.character(dataframe$Food))
You can do this in one line by using data.table package-
> setDT(dataframe)[,Food:=ifelse(Food %in% fruit_list,"fruit",as.character(Food))]
Color Food Count
1: red fruit 2
2: red fruit 5
3: red potato 4
4: red fruit 8
5: red egg 10
6: blue fruit 7
7: blue cheese 5
8: blue yogurt 6
9: blue fruit 9
10: blue butter 11
11: green kale 1
12: green sugar 8
13: green carrot 5
14: green celery 3
15: green radish 7
16: orange cereal 9
17: orange milk 2
18: orange fruit 3
19: orange squash 6
20: orange fruit 4

Summarise? Count occurences in column based on another column

I believe this may have a simple solution but I'm having trouble describing what I need to do (and hence what to search for). I think I need the summarize function. My goal output is at the very bottom.
I'm trying to count the occurrences of a value between each unique value in another column. Here is an example df that hopefully illustrates what I need todo.
library(dplyr)
set.seed(1)
df <- tibble("name" = c(rep("dinah",2),rep("lucy",4),rep("sora",9)),
"meal" = c(rep(c("chicken","beef","fish"),5)),
"date" = seq(as.Date("1999/1/1"),as.Date("2000/1/1"),25),
"num.wins" = sample(0:30)[1:15])
Among other things, I'm trying to summarize (sum) the types of meals each name had using this data.
df
# A tibble: 15 x 4
name meal date num.wins
<chr> <chr> <date> <int>
1 dinah chicken 1999-01-01 8
2 dinah beef 1999-01-26 11
3 lucy fish 1999-02-20 16
4 lucy chicken 1999-03-17 25
5 lucy beef 1999-04-11 5
6 lucy fish 1999-05-06 23
7 sora chicken 1999-05-31 27
8 sora beef 1999-06-25 15
9 sora fish 1999-07-20 14
10 sora chicken 1999-08-14 1
11 sora beef 1999-09-08 4
12 sora fish 1999-10-03 3
13 sora chicken 1999-10-28 13
14 sora beef 1999-11-22 6
15 sora fish 1999-12-17 18
I've made progress with other calculations I'm interested in, below:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins))
# A tibble: 3 x 5
name count medianDate life wins
<chr> <int> <date> <time> <int>
1 dinah 2 1999-01-13 25 days 19
2 lucy 4 1999-03-29 75 days 69
3 sora 9 1999-09-08 200 days 101
My goal is to add an additional column for each type of food, and have the sum of the occurrences of that food displayed in each row, like so:
name count medianDate life wins chicken beef fish
1 dinah 2 1999-01-13 25 days 19 1 1 0
2 lucy 4 1999-03-29 75 days 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3
Though older, and possibly on a deprecation path, reshape2::dcast does this nicely:
reshape2::dcast(df, name ~ meal)
# name beef chicken fish
# 1 dinah 1 1 0
# 2 lucy 1 1 2
# 3 sora 3 3 3
You can understand the formula as rows ~ columns. By default, it will aggregate the values in the columns using the length function---which gives exactly what you want, the count of each.
This can be easily joined to your summary data:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins)) %>%
left_join(reshape2::dcast(df, name ~ meal))
# # A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <int> <int> <int>
# 1 dinah 2 1999-01-13 25 days 19 1 1 0
# 2 lucy 4 1999-03-29 75 days 69 1 1 2
# 3 sora 9 1999-09-08 200 days 101 3 3 3
One option is to use table inside summarise as a list column, unnest and then spread it to 'wide' format
library(tidyverse)
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
n = list(enframe(table(meal))) ) %>%
unnest %>%
spread(name1, value, fill = 0)
# A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <dbl> <dbl> <dbl>
#1 dinah 2 1999-01-13 25 days 19 1 1 0
#2 lucy 4 1999-03-29 75 days 69 1 1 2
#3 sora 9 1999-09-08 200 days 101 3 3 3
I'm not entirely sure why I'm getting the funky formatting for life, but I think this gets at your need for a count of the meal types.
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
chicken = sum(meal == "chicken"),
beef = sum(meal == "beef"),
fish = sum(meal == "fish"))
# A tibble: 3 x 8
name count medianDate life wins chicken beef fish
<chr> <int> <date> <time> <int> <int> <int> <int>
1 dinah 2 1999-01-13 " 25 days" 19 1 1 0
2 lucy 4 1999-03-29 " 75 days" 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3

How to identify mismatch between two data sets in R?

I have two data set. Data set 1 and Data set 2 which is as follow:
Dataset1:-
family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2
Dataset2:-
family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female
Here is the mismatch between the number of member entered in dataset1 and details of individual entered in dataset2. Like For family id 2, the number of member in family is 3 in dataset1 but the in dataset2 there is entry of only 2 member.
How to identify these types of mismatch between two data sets????
both of these views might be helpful for you :
dataset2 %>%
add_count(family_id) %>%
inner_join(dataset1) %>%
mutate(match= n ==number_family_member)
# # A tibble: 7 x 7
# family_id house_id age gender n number_family_member match
# <int> <int> <int> <fctr> <int> <int> <lgl>
# 1 1 1052 24 male 2 2 TRUE
# 2 1 1052 25 female 2 2 TRUE
# 3 2 5042 23 male 2 3 FALSE
# 4 2 5042 20 female 2 3 FALSE
# 5 3 1111 1 male 3 2 FALSE
# 6 3 1111 20 female 3 2 FALSE
# 7 3 1111 21 female 3 2 FALSE
dataset2 %>%
count(family_id) %>%
inner_join(dataset1) %>%
mutate(match= n ==number_family_member)
# # A tibble: 3 x 5
# family_id n house_id number_family_member match
# <int> <int> <int> <int> <lgl>
# 1 1 2 1052 2 TRUE
# 2 2 2 5042 3 FALSE
# 3 3 3 1111 2 FALSE
We can use count to count the number of family members and create a new data frame df3, and then use setequal to compare df1 and df3.
library(dplyr)
df3 <- df2 %>%
count(family_id, house_id) %>%
rename(number_family_member = n)
setequal(df1, df3)
# FALSE: Rows in x but not y: 2, 3. Rows in y but not x: 2, 3.
DATA
df1 <- read.table(text = "family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female",
header = TRUE, stringsAsFactors = FALSE)
This can be done with aggregate and merge.
agg <- aggregate(family_id ~ factor(family_id), dataset2, length)
mrg <- merge(agg, dataset1[c(1, 3)], by.x = "factor(family_id)", by.y = "family_id")
result <- data.frame(family_id = dataset1$family_id)
result$Match <- ifelse(dataset1$number_family_member == mrg$family_id, "match", "mismatch")
result
# family_id Match
#1 1 match
#2 2 mismatch
#3 3 mismatch
rm(agg, mrg) # final clean up
DATA.
dataset1 <- read.table(text = "
family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2
", header = TRUE)
dataset2 <- read.table(text = "
family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female
", header = TRUE)

Resources