r Group by and count - r

I am dealing with a dataset which is as follows
Id Date Color
10 2008-11-17 Red
10 2008-11-17 Red
10 2008-11-17 Blue
10 2010-01-26 Red
10 2010-01-26 Green
10 2010-01-26 Green
10 2010-01-26 Red
29 2007-07-31 Red
29 2007-07-31 Red
29 2007-07-31 Blue
29 2007-07-31 Green
29 2007-07-31 Red
My goal is to create a dataset like this
Color Representation Count Min Max
Red 1 + 1 + 1 = 3 2 + 2 + 3 = 7 2 3
Blue 1 + 1 = 2 1 + 1 1 1
Green 1 + 1 = 2 2 + 1 1 2
Representation
The value in 1st Row , 2nd column (Representation), is 3 because Red is represented three times based on the unique combination of ID and Date. For example, 1st and 2nd rows are the same, Id(10) and Date(2008-11-17) so this combination is represented once (1(10, 2008-11-17)). The 4th and 7th rows are the same Id(10) and Date(2010-01-26)combination, so this unique combination, is represented once (1(10, 2010-01-26)) . The 8th, 9th, 12th are the same combinations of Id(29) and Date(2007-07-31) and similarly this is represented once (1(29, 2007-07-31)). Thus the value is 3 in row 1, column 2.
1(10, 2008-11-17) + 1(10, 2010-10-26) + 1(29, 2007-07-31) =3
Count
The value in 1st Row , 3rd column (Count), is 7 because Red is mentioned twice by ID 10 on 2008-11-17 (2 10, 2008-11-17), again Red is mentioned twice by ID 10 on 2010-01-26 (2 10, 2010-01-26) and three times by ID 29 on 2007-07-31 2 29,2007-07-31
2(10, 2008-11-17) + 2(10, 2010-10-26) + 3(29, 2007-07-31)
Any help on accomplishing this unique frequency/count table is much appreciated.
Dataset
Id = c(10,10,10,10,10,10,10,29,29,29,29,29)
Date = c("2008-11-17", "2008-11-17", "2008-11-17","2010-01-26","2010-01-26","2010-01-26","2010-01-26",
"2007-07-31","2007-07-31","2007-07-31","2007-07-31","2007-07-31")
Color = c("Red", "Red", "Blue", "Red", "Green", "Green", "Red", "Red", "Red", "Blue", "Green", "Red")
df = data.frame(Id, Date, Color)

With dplyr:
library(dplyr)
dat %>% group_by(Color) %>%
summarize(Representation = n_distinct(Id, Date), Count = n())
# # A tibble: 3 × 3
# Color Representation Count
# <fctr> <int> <int>
# 1 Blue 2 2
# 2 Green 2 3
# 3 Red 3 7

Another option is data.table
library(data.table)
setDT(df)[, .(Representation = uniqueN(paste(Id, Date)), Count = .N) , by = Color]
# Color Representation Count
#1: Red 3 7
#2: Blue 2 2
#3: Green 2 3
Update
For the second question, we can try
library(matrixStats)
m1 <- sapply(split(df[["Color"]], list(df$Id, df$Date), drop = TRUE), function(x) table(x))
v1 <- (NA^!m1) * m1
df1 <- data.frame(Color = row.names(m1), Representation = rowSums(m1!=0),
Count = rowSums(m1), Min = rowMins(v1, na.rm=TRUE),
Max = rowMaxs(v1, na.rm=TRUE))
row.names(df1) <- NULL
df1
# Color Representation Count Min Max
#1 Blue 2 2 1 1
#2 Green 2 3 1 2
#3 Red 3 7 2 3

You can use the aggregate() function:
# Make a new column for the Date-Id joined (what you want to base the counts on
df$DateId <- paste(df$Date, df$Id)
# Get the representation values
Representation <- aggregate(DateId ~ Color, data=df,FUN=function(x){length(unique(x))})
Representation
#> Color DateId
#> 1 Blue 2
#> 2 Green 2
#> 3 Red 3
# Get the Count values
Count <- aggregate(DateId ~ Color, data=df,FUN=length)
Count
#> Color DateId
#> 1 Blue 2
#> 2 Green 3
#> 3 Red 7

Related

Bind table only if specific value doesn't using r

Is it possible to bind rows only where specific values are missing?
In this example I have a table with four ID's and some values. Every ID is suppose to have a corresponding value of 1-3. As you can see, some of these values are missing in table dat. To fix this I want to bind dat with dat2, but only where values from the column "value" are missing from table dat. How can I achieve this?
To be clear, I only want 12 rows in total. So for instance, ID 4 has the value 3 and cat_var "green" in table dat. By contrast, in table dat2 ID 4 has the value 3 and cat_var "red". This means that I don't want to bind that row, since there already exists a row for ID 4 and value 3 in table dat. I hope I'm making myself clear.
library(tidyverse)
Data:
id <- c(rep(1:4,3))
value <- c(rep(1:3, each = 4))
dat <- data.frame(id, value)
dat2 <- dat
dat <- dat %>%
slice(1, 3, 5, 6, 7, 8, 10, 12)
dat2$cat_var <- c(rep("orange", 5), rep("green", 5), rep("red", 2))
dat$cat_var <- c(rep("orange", 3), rep("green", 5))
Desired result:
# A tibble: 12 x 3
id value cat_var
<int> <int> <chr>
1 1 1 orange
2 2 1 orange
3 3 1 orange
4 4 1 orange
5 1 2 orange
6 2 2 green
7 3 2 green
8 4 2 green
9 1 3 green
10 2 3 green
11 3 3 red
12 4 3 green
dat %>% bind_rows(dat2) %>% distinct(id, value, .keep_all = T) %>%
arrange(value, id)
results in :
id value cat_var
1 1 1 orange
2 2 1 orange
3 3 1 orange
4 4 1 orange
5 1 2 orange
6 2 2 green
7 3 2 green
8 4 2 green
9 1 3 green
10 2 3 green
11 3 3 red
12 4 3 green
You dont need the arrange (it is just to get the exact same dataframe as the disered result).
Using base R :
row bind dat and dat2 and using duplicated keep unique rows.
result <- rbind(dat, dat2)
result <- result[!duplicated(result[, c('id', 'value')]), ]
result
# id value cat_var
#1 1 1 orange
#2 3 1 orange
#3 1 2 orange
#4 2 2 green
#5 3 2 green
#6 4 2 green
#7 2 3 green
#8 4 3 green
#10 2 1 orange
#12 4 1 orange
#17 1 3 green
#19 3 3 red

R variable number of string concatenations within group_by

Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow

R dataframes: how to make a new column that calculates values based on multiple other columns?

Let's say I have a dataframe with one column for colors and one column for shapes. I want to make a third column that is the number of total rows in the dataframe with that color/shape combination.
You could group by your columns and then add a column with the group size. This is easy in dplyr:
library(dplyr)
dat <- data.frame(col=c("red", "red", "red", "blue"), shape=c("oval", "oval", "circle", "circle"))
dat %>% group_by(col, shape) %>% mutate(ct=n()) %>% ungroup()
# # A tibble: 4 x 3
# col shape ct
# <fct> <fct> <int>
# 1 red oval 2
# 2 red oval 2
# 3 red circle 1
# 4 blue circle 1
If instead you wanted to collapse down all the duplicate rows into a single row with the corresponding count, then dat %>% count(col, shape), as suggested by #RonakShah in the comments, is the way to go.
You can use table to count combinations and use as.data.frame to show it as a data.frame.
as.data.frame(table(x))
# color shape Freq
#1 1 1 1
#2 2 1 0
#3 1 2 1
#4 2 2 2
Data:
(x <- data.frame(color=c(1,1,2,2), shape=c(1,2,2,2)))
# color shape
#1 1 1
#2 1 2
#3 2 2
#4 2 2

How to group and summarise each data frame in a list of data frames

I have a list of data frames:
df1 <- data.frame(one = c('red','blue','green','red','red','blue','green','green'),
one.1 = as.numeric(c('1','1','0','1','1','0','0','0')))
df2 <- data.frame(two = c('red','yellow','green','yellow','green','blue','blue','red'),
two.2 = as.numeric(c('0','1','1','0','0','0','1','1')))
df3 <- data.frame(three = c('yellow','yellow','green','green','green','white','blue','white'),
three.3 = as.numeric(c('1','0','0','1','1','0','0','1')))
all <- list(df1,df2,df3)
I need to group each data frame by the first column and summarise the second column.
Individually I would do something like this:
library(dplyr)
df1 <- df1 %>%
group_by(one) %>%
summarise(sum = sum(one.1))
However I'm having trouble figuring out how to iterate over each item in the list.
I've thought of using a loop:
for(i in 1:3){
all[i] <- all[i] %>%
group_by_at(1) %>%
summarise()
}
But I can't figure out how to specify a column to sum in the summarise() function (this loop is likely wrong in other ways than that anyway).
Ideally I need the output to be another list with each item being the summarised data, like so:
[[1]]
# A tibble: 3 x 2
one sum
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two sum
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three sum
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Would really appreciate any help!
Using purrr::map and summarise at columns contain a letteral dot \\. using matches helper.
library(dplyr)
library(purrr)
map(all, ~.x %>%
#group_by_at(vars(matches('one$|two$|three$'))) %>% #column ends with one, two, or three
group_by_at(1) %>%
summarise_at(vars(matches('\\.')),sum))
#summarise_at(vars(matches('\\.')),list(sum=~sum))) #2nd option
[[1]]
# A tibble: 3 x 2
one one.1
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two two.2
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three three.3
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Here's a base R solution:
lapply(all, function(DF) aggregate(list(added = DF[, 2]), by = DF[, 1, drop = F], FUN = sum))
[[1]]
one added
1 blue 1
2 green 0
3 red 3
[[2]]
two added
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
three added
1 blue 0
2 green 2
3 white 1
4 yellow 1
Another approach would be to bind the lists into one. Here I use data.table and avoid using the names. The only problem is that this may mess up factors but I'm not sure that's an issue in your case.
library(data.table)
rbindlist(all, use.names = F, idcol = 'id'
)[, .(added = sum(one.1)), by = .(id, color = one)]
id color added
1: 1 red 3
2: 1 blue 1
3: 1 green 0
4: 2 red 1
5: 2 yellow 1
6: 2 green 1
7: 2 blue 1
8: 3 yellow 1
9: 3 green 2
10: 3 white 1
11: 3 blue 0

Filter by combination of (row) pairs

I have a dataframe in a long format and I want to filter pairs based on unique combinations of values. I have a dataset that looks like this:
id <- rep(1:4, each=2)
type <- c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
df <- data.frame(id,type)
df
id type
1 1 blue
2 1 blue
3 2 red
4 2 yellow
5 3 blue
6 3 red
7 4 red
8 4 yellow
Let's say each id is a respondent and type is a combination of treatments. Individual 1 saw two objects, both of them blue; individual 2 saw one red object and a yellow one; and so on.
How do I keep, for example, those that saw the combination "red" and "yellow"? If I filter by the combination "red" and "yellow" the resulting dataset should look like this:
id type
3 2 red
4 2 yellow
7 4 red
8 4 yellow
It should keep respondents number 2 and number 4 (only those that saw the combination "red" and "yellow"). Note that it does not keep respondent number 3 because she saw "blue" and "red" (instead of "red" and "yellow"). How do I do this?
One solution is to reshape the dataset into a wide format, filter it by column, and restack again. But I am sure there is another way to do it without reshaping the dataset. Any idea?
A dplyr solution would be:
library(dplyr)
df <- data_frame(
id = rep(1:4, each = 2),
type = c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
)
types <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(types %in% type))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Update
Allowing for the equal combinations, e.g. blue, blue, we have to change the filter-call to the following:
types2 <- c("blue", "blue")
df %>%
group_by(id) %>%
filter(sum(types2 == type) == length(types2))
#> # A tibble: 2 x 2
#> # Groups: id [1]
#> id type
#> <int> <chr>
#> 1 1 blue
#> 2 1 blue
This solution also allows different types
df %>%
group_by(id) %>%
filter(sum(types == type) == length(types))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Let's use all() to see if all rows within group match a set of values.
library(tidyverse)
test_filter <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(test_filter %in% type))
# A tibble: 4 x 2
# Groups: id [2]
id type
<int> <fctr>
1 2 red
2 2 yellow
3 4 red
4 4 yellow
I modified your data and did the following.
df <- data.frame(id = rep(1:4, each=3),
type <- c("blue", "blue", "green", "red", "yellow", "purple",
"blue", "orange", "yellow", "yellow", "pink", "red"),
stringsAsFactors = FALSE)
id type
1 1 blue
2 1 blue
3 1 green
4 2 red
5 2 yellow
6 2 purple
7 3 blue
8 3 orange
9 3 yellow
10 4 yellow
11 4 pink
12 4 red
As you see, there are three observations for each id. id 2 and 4 have both red and yellow. They also have non-target colors (i.e., purple, and pink). I wanted to preserve these observations. In order to achieve this task, I wrote the following code. The code can be read like this. "For each id, check if there is any red and yellow using any(). When both conditions are TRUE, keep all rows for the id."
group_by(df, id) %>%
filter(any(type == "yellow") & any(type == "red"))
id type
4 2 red
5 2 yellow
6 2 purple
10 4 yellow
11 4 pink
12 4 red
Using data.table:
library(data.table)
setDT(df)
df[, type1 := shift(type, type = "lag"), by = id]
df1 <- df[type == "yellow" & type1 == "red", id]
df <- df[id %in% df1, ]
df[, type1 := NULL]
It gives:
id type
1: 2 red
2: 2 yellow
3: 4 red
4: 4 yellow

Resources