Summarize occurrences by area and then by custom groups - r

I have below dataset that takes a 2 column dataset and creates age group categories depending on stated CustomerAge.
library(tidyverse)
df <-
read.table(textConnection("Area CustomerAge
A 28
A 40
A 70
A 19
B 13
B 12
B 72
B 90"), header=TRUE)
df2 <- df %>%
mutate(
# Create categories
Customer_Age_Group = dplyr::case_when(
CustomerAge <= 18 ~ "0-18",
CustomerAge > 18 & CustomerAge <= 60 ~ "19-60",
CustomerAge > 60 ~ ">60"
))
What I am looking to achieve is an output summary that looks like the below:
Area
Customer_Age_Group
Occurrences
A
0-18
0
A
19-59
3
A
>60
1
B
0-18
2
B
19-59
0
B
>60
2

To include also 0 occurences you need count(), ungroup() and complete():
df2 %>% group_by(Area, Customer_Age_Group,.drop = FALSE) %>%
count() %>%
ungroup() %>%
complete(Area, Customer_Age_Group, fill=list(n=0))
This will show also 0 occurences.
To sort for Area and Age group:
df2 %>% group_by(Area, Customer_Age_Group,.drop = FALSE) %>%
count() %>%
ungroup() %>%
complete(Area, Customer_Age_Group, fill=list(n=0)) %>%
arrange(Area, parse_number(Customer_Age_Group))

group_by and summarise is what you're looking for.
df2 %>% group_by(Area, Customer_Age_Group) %>% summarise(Occurences = n())
However note that this won't show categories with zero occurences in your data set.

Related

How can I check if unique values of a column has multiple occurrences for different values of another column in R?

Sample data:
set.seed(4)
cl <- sample(LETTERS[1:2], 100, replace = T)
seller <- round(runif(100, min=1, max=80))
df <- data.frame(cl, seller)
cl seller
1 B 21
2 A 51
3 A 22
4 A 43
5 A 38
6 B 46
7 A 54
8 B 18
9 A 78
.......
99 A 32
100 B 8
I want to check the number of times one unique value of seller has occurred for both A and B. Suppose, in the data frame with this particular seed, you'll see 7 has appeared for both A and B, so 7 will be counted.
My attempt:
df %>%
filter(cl=='A')-> d1
df %>%
filter(cl=='B')-> d2
d3 <- merge(d1, d2, by='seller') %>%
distinct(seller)
nrow(d3)
17
So, 17 sellers have both cl: A and B.
My attempt, so far, has been very sub-optimal. It yields the result, but there has to be a better way with dplyr or even with base R which I can't figure out. Also, it will be very time-consuming for a bigger data set if I do it like this.
How can I solve this in a better, more tidy way?
We could use n_distinct (assuming only 'A', 'B' values found in 'cl' column):
library(dplyr)
df %>%
group_by(seller) %>%
summarise(n = n_distinct(cl), .groups = 'drop') %>%
filter(n == 2) %>%
nrow
Output:
[1] 17
Or may also do
df %>%
group_by(seller) %>%
summarise(n = all(c("A", "B") %in% cl), .groups = 'drop') %>%
pull(n) %>%
sum
[1] 17
A base R approach using table, colSums and sum
sum(colSums(table(df) > 0) == 2)
#[1] 17

Performing operations on dplyr summaries

Assume we have some random data:
data <- data.frame(ID = rep(seq(1:3),3),
Var = sample(1:9, 9))
we can compute summarizing operations using dplyr, like this:
library(dplyr)
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var))
which gives output that looks like this below an r markdown chunk:
ID count
1 3
2 3
3 3
I would like to know how we can perform operations on individual data points in this dplyr output without saving the output in a separate object.
For example in the output of summarise, lets say we wanted to subtract the output value for ID == 3 from the sum of the output values for ID == 1 and ID == 2, and leave the output values for ID == 1 and ID == 2 like they are. The only way I know to do this is to save the summary output in another object and perform the operation on that object, like this:
a<-
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var))
a
#now perform the operation on a
a[3,2] <- a[2,1]+a[2,2]-1
a
a now looks like this:
ID count
1 3
2 3
3 4
Is there a way to do this in dplyr output without making new objects? Can we somehow use mutate directly on output like this?
We can add a mutate after the summarise with replace to modify the location specified in list
library(dplyr)
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var)) %>%
mutate(count = replace(count, n(), count[2] + ID[2] - 1))
-output
# A tibble: 3 x 2
ID count
<int> <dbl>
1 1 3
2 2 3
3 3 4
Or if there are more than two columns, use sum on the sliced row
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var)) %>%
mutate(count = replace(count, n(), sum(cur_data() %>%
slice(2)) - 1))
Alternative that does what you say you want ("sum others") but not what you demonstrate.
data %>%
group_by(ID) %>%
summarize(count = n_distinct(Var)) %>%
mutate(count = if_else(ID == 3L, sum(count) - count, count))
# # A tibble: 3 x 2
# ID count
# <int> <int>
# 1 1 3
# 2 2 3
# 3 3 6
or, if there are other IDs that should not be included in the sum, then
data %>%
group_by(ID) %>%
summarize(count = n_distinct(Var)) %>%
mutate(count = if_else(ID == 3L, sum(count[ID %in% 1:2]), count))

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

Find duplicate values and have references

My data = data.lab
data.lab <- data.frame(Name=c("A","e","b","c","d"),
bp =c( 12,12,11,12,11),
sugar = c(19,21,23,19,23))
I want to have only duplicate names with the reference
desired output
lab.data <- data.frame(Name=c("A","b","c","d"),
bp =c( 12,11,12,11),
sugar = c(19,23,19,23),
pair=c(1,1,2,2))
dub.data <- duplicated(data.lab) | duplicated(data.lab, fromLast = TRUE)
out.1=data.lab[dub.data, ]
this gives the duplicate data but i need a column as what are the duplicate pairs
With dplyr, you can do:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() == 2) %>%
mutate(pair = seq_along(Name))
Name bp sugar pair
<fct> <dbl> <dbl> <int>
1 A 12 19 1
2 b 11 23 1
3 c 12 19 2
4 d 11 23 2
Or:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() == 2) %>%
mutate(pair = row_number())
Or if there could be more than two pairs of duplicates:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() > 1) %>%
mutate(pair = seq_along(Name))
Or:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() > 1) %>%
mutate(pair = row_number())
Or to group by all variables except of "Name":
data.lab %>%
group_by_at(vars(-matches("(Name)"))) %>%
filter(n() > 1) %>%
mutate(pair = seq_along(Name))
Or:
data.lab %>%
group_by_at(vars(-matches("(Name)"))) %>%
filter(n() > 1) %>%
mutate(pair = row_number())
Continuing from your approach , we can use ave in base R
dat1 <- data.lab[duplicated(data.lab[c("bp", "sugar")]) |
duplicated(data.lab[c("bp", "sugar")], fromLast = TRUE) , ]
dat1$pair <- with(dat1, ave(Name, bp, sugar, FUN = seq_along))
dat1
# Name bp sugar pair
#1 A 12 19 1
#2 b 11 23 1
#3 c 12 19 2
#4 d 11 23 2

R Sum numbers within string

I have a question:
I have a dataset like this simple example:
df<-data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/+2","19/18/14"))
I want to end up with a dataset that has split the score numbers. I have a problem with the /+2 part. when it says "3/+2"it actually means: "3/3+2" which would finally give "3/5". So what I would like some help with, is to end up with a dataset like this:
ID Score
A 15
B 16/18/19/21/6
C 3/5
D 19/18/14
I already found out that I can then seperate the score by
df<-df %>%
mutate(Score = strsplit(as.character(ID), "/")) %>%
unnest(Score)
But I don't know how I can let the numbers duplicate and then sum when /+ occurs, could someone help me?
It could be probably solved in a more elegant way, but here is one possibility:
df %>%
mutate(Score = strsplit(as.character(Score), "/")) %>%
unnest() %>%
rowwise() %>%
mutate(Score = eval(parse(text = paste0(Score)))) %>%
group_by(ID) %>%
mutate(Score = paste0(Score, collapse = "/")) %>%
distinct()
ID Score
<fct> <chr>
1 A 15
2 B 16/18/21/6
3 C 3/5
4 D 19/18/14
Sample data:
df <- data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/3+2","19/18/14"))
It splits "Score" based on /, converts characters to expression by parse() and then transforms it back.
Using the data you provided and the pattern from #A. Suliman:
df %>%
mutate(Score = strsplit(gsub("(\\d+)/*\\+(\\d+)","\\1/\\1+\\2", Score), "/")) %>%
unnest() %>%
rowwise() %>%
mutate(Score = eval(parse(text = paste0(Score)))) %>%
group_by(ID) %>%
mutate(Score = paste0(Score, collapse = "/")) %>%
distinct()
ID Score
<fct> <chr>
1 A 15
2 B 16/18/19/21/6
3 C 3/5
4 D 19/18/14
We can use gsubfn to do this in a compact way
library(gsubfn)
library(tidyverse)
df %>%
mutate(Score = gsubfn("\\d+\\+\\d+", ~ eval(parse(text = x)), Score))
# ID Score
#1 A 15
#2 B 16/18/21/6
#3 C 3/5
#4 D 19/18/14
data
df <- data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/3+2","19/18/14"), stringsAsFactors = FALSE)
library(dplyr)
library(tidyr) #separate_rows, no need for unnest
df %>% rowwise()%>%
mutate(Score_upd=paste0(sapply(unlist(strsplit(gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2',Score),'/')),
function(x)eval(parse(text = x))),collapse = '/')) %>%
separate_rows(Score_upd,sep = '/')
#short version
df %>% mutate(Score=gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2',Score)) %>%
separate_rows(Score,sep='/') %>% rowwise() %>% mutate(Score=eval(parse(text=Score))) %>%
group_by(ID) %>% summarise(Score=paste0(Score,collapse = '/'))
# A tibble: 4 x 2
ID Score
<fct> <chr>
1 A 15
2 B 16/18/19/21/6
3 C 3/5
4 D 19/18/14
The main idea is using gsub to separate 2+3 correctly, e.g:
gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2','20/8/2+3') #/* means 0 or 1 occurence of / e.g, 19+2 and 3/+2.
[1] "20/8/2/2+3"
Then
valid_str <- gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2','20/8/2+3')
sapply(unlist(strsplit(valid_str,'/')),function(x) eval(parse(text=x)))
20 8 2 2+3
20 8 2 5
#OR
sapply(unlist(strsplit(valid_str,'/')),function(x) sum(as.numeric(unlist(strsplit(x,'\\+')))))
20 8 2 2+3
20 8 2 5

Resources