How to build a summary data frame - r

I have a data set looks like this:
and I would like to get a summary data set that will looks like this:
what should i do? Thanks. The sample.data can be build through following codes:
ID<- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18")
Group<-c("A","B","C","D","D","D","A","B","D","C","B","D","A","A","C","B","B","B")
Color<-c("Green","Yellow","Red","Red","Red","Yellow","Green","Green","Yellow","Red","Red","Yellow","Yellow","Yellow","Green","Red","Red","Green")
Realy_Love<-c("Y","N","Y","Y","N","N","Y","Y","Y","N","N","Y","N","Y","N","Y","N","Y")
Sample.data <- data.frame(ID, Group, Color, Realy_Love)

You can use dplyr and group by the following items:
Sample.data %>%
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n())
# Group Color Realy_Love Obs
# <chr> <chr> <chr> <int>
# 1 A Green Y 2
# 2 A Yellow N 1
# 3 A Yellow Y 1
# 4 B Green Y 2
# 5 B Red N 2
# 6 B Red Y 1
# 7 B Yellow N 1
# 8 C Green N 1
# 9 C Red N 1
# 10 C Red Y 1
# 11 D Red N 1
# 12 D Red Y 1
# 13 D Yellow N 1
# 14 D Yellow Y 2

Use dplyr from the Tidyverse to get a summary. You can then use arrange() to sort by Color or another variable.
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n()) %>%
arrange(Color)

With dplyr, you even don't need to group the columns, just use one step solution with the count() function:
Sample.data %>%
count(Group, Color, Realy_Love, sort = TRUE)
The optional sort = TRUE argument says to sort with descending order from the most frequent:
Group Color Realy_Love n
1 A Green Y 2
2 B Green Y 2
3 B Red N 2
4 D Yellow Y 2
5 A Yellow N 1
6 A Yellow Y 1
7 B Red Y 1
8 B Yellow N 1
9 C Green N 1
10 C Red N 1
11 C Red Y 1
12 D Red N 1
13 D Red Y 1
14 D Yellow N 1

Related

R Counting Occurences in a Column of Data Frame, Grouped by Another Column

I basically have a data frame with a column of letters and a column of colors:
x <- data.frame(col1=c("a","b","a","c","d","d","c","a","b","c"),
col2=c("red","orange","yellow","red","red","yellow","orange","yellow","red","orange"))
col1 col2
a red
b orange
a yellow
c red
d red
d yellow
c orange
a yellow
b red
c orange
My goal is to create a second data frame that counts the number of occurences of each color in col2 of x for each letter in col1. Basically:
Letters Occurences Red Orange Yellow
a 3 1 0 2
b 2 1 1 0
c 3 1 2 0
d 2 1 0 1
Right now, I just brute forced it since there are only 3 factors of col2. I used:
df <- data.frame(Letters = levels(factor(x$col1)))
df$Occurences <- table(x$col1)
df$red <- table(factor(x$col1[x$col2=="red"],levels=levels(factor(x$col1))))
df$orange <- table(factor(x$col1[x$col2=="orange"],levels=levels(factor(x$col1))))
df$yellow <- table(factor(x$col1[x$col2=="yellow"],levels=levels(factor(x$col1))))
Is there an easier way to do this, as opposed to doing each column of df one by one? Especially with a data set that has a lot more than 3 factors?
Use pivot_wider from tidyr
library(tidyr)
x %>%
pivot_wider(names_from = col2, values_from = col2, values_fn = "length", values_fill = 0)
Output:
# A tibble: 4 × 4
col1 red orange yellow
<chr> <int> <int> <int>
1 a 1 0 2
2 b 1 1 0
3 c 1 2 0
4 d 1 0 1
as.data.frame.matrix(addmargins(table(x), 2))
orange red yellow Sum
a 0 1 2 3
b 1 1 0 2
c 2 1 0 3
d 0 1 1 2

R data imputation from group_by table based on count

group = c(1,1,4,4,4,5,5,6,1,4,6,1,1,1,1,6,4,4,4,4,1,4,5,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c','a','a','a','a','c','c','c','c','c','a','c','a','c')
sleep = c('y','n','y','y','y','n','n','y','n','y','n','y','y','n','m','y','n','n','n','n',NA, NA, NA, NA)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% count(sleep)
print(group_animal)
I would like to replace the NA values in the test df's sleep column by the highest count of sleep answer based on group and animal.
Such that Group 1, Animal a with NAs in the sleep column should have a sleep value of 'y' because that is the value with the highest count among Group 1 Animal a.
Group 4 animal c with NAs for sleep should have 'n' as the sleep value as well.
Another option is replacing the NAs with the Mode. You can use the Mode function from this post in the na.aggregate function from zoo to replace these NAs like this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
group = c(1,1,4,4,4,5,5,6,1,4,6,1,1,1,1,6,4,4,4,4,1,4,5,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c','a','a','a','a','c','c','c','c','c','a','c','a','c')
sleep = c('y','n','y','y','y','n','n','y','n','y','n','y','y','n','m','y','n','n','n','n',NA, NA, NA, NA)
test = data.frame(group, animal, sleep)
library(dplyr)
library(zoo)
test %>%
group_by(group, animal) %>%
mutate(sleep = na.aggregate(sleep , FUN=Mode)) %>%
ungroup()
#> # A tibble: 24 × 3
#> group animal sleep
#> <dbl> <chr> <chr>
#> 1 1 a y
#> 2 1 b n
#> 3 4 c y
#> 4 4 c y
#> 5 4 d y
#> 6 5 a n
#> 7 5 b n
#> 8 6 c y
#> 9 1 b n
#> 10 4 d y
#> # … with 14 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Created on 2022-07-26 by the reprex package (v2.0.1)
Here is tail of output:
> tail(test)
# A tibble: 6 × 3
group animal sleep
<dbl> <chr> <chr>
1 4 c n
2 4 c n
3 1 a y
4 4 c n
5 5 a n
6 6 c y
Update now with group_by(group, animal) thnx #Quinten, removed prior answer:
group by animal
use replace_na with the replace argument as sleep[n==max(n)]
new: in case of ties like in group 5 add !is.na(sleep) to avoid conflicts:
library(dplyr)
library(tidyr)
group_animal %>%
group_by(group, animal) %>%
arrange(desc(sleep), .by_group = TRUE) %>%
mutate(sleep = replace_na(sleep, sleep[n==max(n) & !is.na(sleep)]))
group animal sleep n
<dbl> <chr> <chr> <int>
1 1 a y 3
2 1 a n 1
3 1 a m 1
4 1 a y 1
5 1 b n 2
6 4 c y 2
7 4 c n 4
8 4 c n 1
9 4 d y 2
10 5 a n 1
11 5 a n 1
12 5 b n 1
13 6 c y 2
14 6 c n 1
15 6 c y 1
Try this.
This method essential creates a custom column to coalesce with sleep, it subsets sleep based on the max count values obtained from str_count
library(dplyr)
test |>
group_by(group, animal) |>
mutate(sleep = coalesce(sleep, sleep[max(stringr::str_count(paste(sleep, collapse = ""), pattern = sleep), na.rm = TRUE)])) |>
ungroup()
group animal sleep
1 1 a y
2 1 b n
3 4 c y
4 4 c y
5 4 d y
6 5 a n
7 5 b n
8 6 c y
9 1 b n
10 4 d y
11 6 c n
12 1 a y
13 1 a y
14 1 a n
15 1 a m
16 6 c y
17 4 c n
18 4 c n
19 4 c n
20 4 c n
21 1 a y
22 4 c n
23 5 a n
24 6 c n

How to group and summarise each data frame in a list of data frames

I have a list of data frames:
df1 <- data.frame(one = c('red','blue','green','red','red','blue','green','green'),
one.1 = as.numeric(c('1','1','0','1','1','0','0','0')))
df2 <- data.frame(two = c('red','yellow','green','yellow','green','blue','blue','red'),
two.2 = as.numeric(c('0','1','1','0','0','0','1','1')))
df3 <- data.frame(three = c('yellow','yellow','green','green','green','white','blue','white'),
three.3 = as.numeric(c('1','0','0','1','1','0','0','1')))
all <- list(df1,df2,df3)
I need to group each data frame by the first column and summarise the second column.
Individually I would do something like this:
library(dplyr)
df1 <- df1 %>%
group_by(one) %>%
summarise(sum = sum(one.1))
However I'm having trouble figuring out how to iterate over each item in the list.
I've thought of using a loop:
for(i in 1:3){
all[i] <- all[i] %>%
group_by_at(1) %>%
summarise()
}
But I can't figure out how to specify a column to sum in the summarise() function (this loop is likely wrong in other ways than that anyway).
Ideally I need the output to be another list with each item being the summarised data, like so:
[[1]]
# A tibble: 3 x 2
one sum
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two sum
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three sum
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Would really appreciate any help!
Using purrr::map and summarise at columns contain a letteral dot \\. using matches helper.
library(dplyr)
library(purrr)
map(all, ~.x %>%
#group_by_at(vars(matches('one$|two$|three$'))) %>% #column ends with one, two, or three
group_by_at(1) %>%
summarise_at(vars(matches('\\.')),sum))
#summarise_at(vars(matches('\\.')),list(sum=~sum))) #2nd option
[[1]]
# A tibble: 3 x 2
one one.1
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two two.2
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three three.3
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Here's a base R solution:
lapply(all, function(DF) aggregate(list(added = DF[, 2]), by = DF[, 1, drop = F], FUN = sum))
[[1]]
one added
1 blue 1
2 green 0
3 red 3
[[2]]
two added
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
three added
1 blue 0
2 green 2
3 white 1
4 yellow 1
Another approach would be to bind the lists into one. Here I use data.table and avoid using the names. The only problem is that this may mess up factors but I'm not sure that's an issue in your case.
library(data.table)
rbindlist(all, use.names = F, idcol = 'id'
)[, .(added = sum(one.1)), by = .(id, color = one)]
id color added
1: 1 red 3
2: 1 blue 1
3: 1 green 0
4: 2 red 1
5: 2 yellow 1
6: 2 green 1
7: 2 blue 1
8: 3 yellow 1
9: 3 green 2
10: 3 white 1
11: 3 blue 0

Select the n most frequent values in a variable

I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.
An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.
ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange
I therefore want to output the following:
Top 2 values of ID are:
A and C
I will then select the rows corresponding to ID A and C:
ID col
A blue
A purple
A green
C red
C blue
C yellow
C orange
You can try a tidyverse. Add the counts of ID's, then filter for the top two (using < 3) or top ten (using < 11):
library(tidyverse)
d %>%
add_count(ID) %>%
filter(dense_rank(-n) < 3)
# A tibble: 7 x 3
ID col n
<fct> <fct> <int>
1 A blue 3
2 A purple 3
3 A green 3
4 C red 4
5 C blue 4
6 C yellow 4
7 C orange 4
Data
d <- read.table(text="ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange", header=T)
We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.
df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]
# ID col
#1 A blue
#2 A purple
#3 A green
#6 C red
#7 C blue
#8 C yellow
#9 C orange
With the tidyverse and its top_n :
library(tidyverse)
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2)
Selecting by n()
# A tibble: 2 x 2
ID `n()`
<fct> <int>
1 A 3
2 C 4
To complete with the subset :
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2) %>%
{ filter(d, ID %in% .$ID) }
Selecting by n()
ID col
1 A blue
2 A purple
3 A green
4 C red
5 C blue
6 C yellow
7 C orange
(we use the braces because we don't feed the left hand side result as the first argument of the filter)

dplyr split by semi colon in case_when

Suppose I have a dataframe df
library(dplyr)
df <- data.frame(ID = c(1:10), Type = c('a', 'a;b','b','a','b','b','c','a;c','b;c','c'))
And I want to add a column called color, based on the values that appear in Type. (This is just an example, in my code there are many more variations of Type, i.e. d;f, e;q,a;z etc)
df %>%
mutate(color = case_when(
Type == 'a' ~ 'red',
Type == 'b' ~ 'blue',
Type == 'c' ~ 'green',
TRUE ~ as.character(Type)
))
As this stands, it returns
ID Type color
1 1 a red
2 2 a;b a;b
3 3 b blue
4 4 a red
5 5 b blue
6 6 b blue
7 7 c green
8 8 a;c a;c
9 9 b;c b;c
10 10 c green
I am curious if there a way to split by semi-colon within the case_when(), in order to produce the output
ID Type color
1 1 a red
2 2 a;b red;blue
3 3 b blue
4 4 a red
5 5 b blue
6 6 b blue
7 7 c green
8 8 a;c red;green
9 9 b;c blue;green
10 10 c green
You can split the Type column into separate rows, map it to colors and then paste them together:
library(dplyr); library(tidyr);
df %>%
separate_rows(Type) %>%
mutate(color = case_when(
Type == 'a' ~ 'red',
Type == 'b' ~ 'blue',
Type == 'c' ~ 'green',
TRUE ~ as.character(Type)
)) %>%
group_by(ID) %>%
summarise_all(funs(paste0(., collapse=";")))
# A tibble: 10 x 3
# ID Type color
# <int> <chr> <chr>
# 1 1 a red
# 2 2 a;b red;blue
# 3 3 b blue
# 4 4 a red
# 5 5 b blue
# 6 6 b blue
# 7 7 c green
# 8 8 a;c red;green
# 9 9 b;c blue;green
#10 10 c green
Besides case_when, you can also put the character to color maps in a vector and then retrieve the colors later:
map <- c(a = 'red', b = 'blue', c = 'green')
df %>%
separate_rows(Type) %>%
mutate(color = map[Type]) %>%
group_by(ID) %>%
summarise_all(funs(paste0(., collapse=";")))

Resources