How to count the cumulative number of subgroupings using dplyr? - r

I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)

Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA

Related

R total multiple columns at once with n()

I don't know if I am missing something very obvious here or not, but I am having trouble getting the desired results format for a count. These are all yes, no or NA answers to a question.
My data looks a bit like:
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
df %>%
group_by(A, B, C)%>%
summarise(count = n())
I have also tried
count(A, B, C)
with exactly the same results.
I want to count the total number of 0, 1 and NA responses for each column: (rows and columns are interchangeable here, it's the count of response v column format of the table that I'm after.)
Response 0 1 NA
Column A 4 2 1
Column B 3 2 2
Column C 1 3 3
What I am getting instead is
A B C n
0 0 1 1
0 1 0 1
0 1 NA 1
0 NA 1 1
1 0 NA 2
NA NA 1 1
In other words, it's counting the number of times each unique combination of ABC appears. How do I get it to focus on counting the columns and not the rows?
You can apply the table() function across the columns:
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
t(apply(df, 2,table, useNA = "always"))
#> 0 1 <NA>
#> A 4 2 1
#> B 3 2 2
#> C 1 3 3
Created on 2022-08-05 by the reprex package (v2.0.1)
One alternate tidyverse solution would be the following:
library(tidyverse)
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
x <- df %>%
mutate(across(everything(), ~fct_explicit_na(as.factor(.x),"NA"))) %>%
map(., ~c(table(.x))) %>%
bind_rows(.id = 'Response')
x
#> # A tibble: 3 × 4
#> Response `0` `1` `NA`
#> <chr> <int> <int> <int>
#> 1 A 4 2 1
#> 2 B 3 2 2
#> 3 C 1 3 3
Created on 2022-08-05 by the reprex package (v2.0.1)
I guess it might need some data re-shaping if you want to use dplyr::n().
First transform df into a "long" format, you'll get a two-column dataframe, from which we can group by everything (group_by_all()) and do your summarize(n()). Finally, transform it back to a "wide" format.
library(tidyverse)
df %>% pivot_longer(everything(), names_to = "Response") %>%
group_by_all() %>%
summarize(n = n()) %>%
pivot_wider(names_from = "value", values_from = "n")
# A tibble: 3 × 4
# Groups: Response [3]
Response `0` `1` `NA`
<chr> <int> <int> <int>
1 A 4 2 1
2 B 3 2 2
3 C 1 3 3
Using table and stack you can try:
t(table(stack(df), useNA = "ifany"))
Output
values
ind 0 1 <NA>
A 4 2 1
B 3 2 2
C 1 3 3
If you find yourself wanting to apply the same operation to multiple columns in your data it could be a hint that you should reshape your data to a "longer" format, such that each row represents a single observation. Once your data is in this format you can use table() to get the summary you're after:
df_tidy <-
df %>%
pivot_longer(cols = everything(), names_to = "group", values_to = "response")
print(df_tidy)
#> # A tibble: 21 x 2
#> group response
#> <chr> <int>
#> 1 A 0
#> 2 B NA
#> 3 C 1
#> 4 A 1
#> 5 B 0
#> 6 C NA
#> 7 A 0
#> 8 B 1
#> 9 C 0
#> 10 A NA
#> # … with 11 more rows
table(df_tidy, useNA = "ifany")
#> response
#> group 0 1 <NA>
#> A 4 2 1
#> B 3 2 2
#> C 1 3 3

Return the column name of the second largest value of a row

df = data.frame( ID = c (1,2,3,4,5), a = c (0,2,0,1,0),
b = c (0,3,2,NA,0), c = c(0,4,NA,NA,1),
d = c (2,5,4,NA,1))
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
df<-df %>% mutate( second_largest=apply(.[2:5], 1, function(x) names(x)[maxn(2)(x)]) )
I used the R codes above to obtain the column name for the second largest value of a,b,c,d. For ID=4, because there are missing values for b,c,d, so the name of second largest value should be NA. However, the codes return b. How should I remove missing value?
one more approach
df = data.frame( ID = c (1,2,3,4,5), a = c (0,2,0,1,0),
b = c (0,3,2,NA,0), c = c(0,4,NA,NA,1),
d = c (2,5,4,NA,1))
library(dplyr, warn.conflicts = F)
df %>% group_by(ID) %>% rowwise() %>%
mutate(name = {x <- c_across(everything());
if (sum(!is.na(x)) >= 2) tail(head(names(cur_data())[order(x, decreasing = T)],2),1) else NA})
#> # A tibble: 5 x 6
#> # Rowwise: ID
#> ID a b c d name
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 0 0 0 2 a
#> 2 2 2 3 4 5 c
#> 3 3 0 2 NA 4 b
#> 4 4 1 NA NA NA <NA>
#> 5 5 0 0 1 1 d
If you have to do it for a few columns instead
df %>% group_by(ID) %>% rowwise() %>%
mutate(name = {x <- c_across(c('a', 'c'));
if (sum(!is.na(x)) >= 2) tail(head(c('a', 'c')[order(x, decreasing = T)],2),1) else NA})
# A tibble: 5 x 6
# Rowwise: ID
ID a b c d name
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 0 0 2 c
2 2 2 3 4 5 a
3 3 0 2 NA 4 NA
4 4 1 NA NA NA NA
5 5 0 0 1 1 a
I think you can use the following solution. I tested some possible configurations of numbers and it worked:
library(dplyr)
library(purrr)
df %>%
mutate(Name = pmap_chr(., ~ {x <- c(...)[-1];
if(sum(is.na(x)) >= 3) {
NA
} else {
ind <- which(x == max(x[!is.na(x)]))
if(length(ind) > 1) {
colnames(df[-1])[ind[2]]
} else {
colnames(df[-1])[which(x == sort(x)[length(sort(x))-1])][1]
}
}
}
))
ID a b c d Name
1 1 0 0 0 2 a
2 2 2 3 4 5 c
3 3 0 2 NA 4 b
4 4 1 NA NA NA <NA>
5 5 0 0 1 1 d
We can change the function to -
maxn <- function(n) function(x) order(x, decreasing = TRUE)[!is.na(x)][n]
The code will then work with your approach -
library(dplyr)
df %>%
mutate(second_largest=apply(.[2:5], 1, function(x) names(x)[maxn(2)(x)]))
# ID a b c d second_largest
#1 1 0 0 0 2 a
#2 2 2 3 4 5 c
#3 3 0 2 NA 4 b
#4 4 1 NA NA NA <NA>
#5 5 0 0 1 1 d

R - Mutate column based on another column

Using R:
For the dataframe:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
How do you add a column such that the output is the same as:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
B<-c(1,1,1,0,1,0,1,1,0,0,0)
mutate(df,B)
In other words, is there a formula for column 'B' - such that it looks at column 'A'....and lists '1', 3 times the puts a '0' .....etc etc.
So - the desired output (given column 'A') is:
Thankyou.
Here I assign a new group each time A changes, then within each group put a 1 in B in the first #A rows.
(If the values of A are distinct for each group, you could replace the first two lines with group_by(A), but unclear if that's a fair assumption.)
library(dplyr)
df %>%
mutate(group = cumsum(A != lag(A, default = 0))) %>%
group_by(group) %>%
mutate(B = 1 * (row_number() <= A)) %>%
ungroup()
result
# A tibble: 11 x 3
A group B
<dbl> <int> <dbl>
1 3 1 1
2 3 1 1
3 3 1 1
4 3 1 0
5 1 2 1
6 1 2 0
7 2 3 1
8 2 3 1
9 2 3 0
10 2 3 0
11 2 3 0
After grouping by 'A', use rep with 1, 0 on the value of 'A' and the difference of number of rows with group value
library(dplyr)
library(data.table)
df %>%
group_by(A, grp = rleid(A)) %>%
mutate(B = rep(c(1, 0), c(first(A), n() - first(A)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 2
# A B
# <dbl> <dbl>
# 1 3 1
# 2 3 1
# 3 3 1
# 4 3 0
# 5 1 1
# 6 1 0
# 7 2 1
# 8 2 1
# 9 2 0
#10 2 0
#11 2 0
Or using rle from base R
with(rle(df$A), rep(rep(c(1, 0), length(values)), c(values, lengths-values)))
#[1] 1 1 1 0 1 1 0 1 0 0 0

which.max() by groups but output in the dataframe

There is this data frame given by (an example):
df <- read.table(header = TRUE, text = 'Group Utility
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
I want to use any command (I have been trying iterations of which.max() to no avail) to get an additional row in the dataset, say choice that is an indicator if Value is the max for the group given by Group elements. The table would look like:
Group Utility Choice
A 12 1
A 10 0
B 3 0
B 5 0
B 6 1
C 1 1
D 3 0
D 4 1
You can try this with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Choice = ifelse(Utility == max(Utility), 1, 0)) %>%
ungroup()
Output
# A tibble: 8 x 3
Group Utility Choice
<fct> <int> <dbl>
1 A 12 1
2 A 10 0
3 B 3 0
4 B 5 0
5 B 6 1
6 C 1 1
7 D 3 0
8 D 4 1
A one-liner base R solution.
df$Choice <- with(df, ave(Utility, Group, FUN = function(x) +(x == max(x))))
df
# Group Utility Choice
#1 A 12 1
#2 A 10 0
#3 B 3 0
#4 B 5 0
#5 B 6 1
#6 C 1 1
#7 D 3 0
#8 D 4 1
An option with data.table
library(data.table)
setDT(df)[, +(Utility == max(Utility)), Group]

Aggregate rows with specific shared value

I want to aggregate my data as follows:
Aggregate only for successive rows where status = 0
Keep age and sum up points
Example data:
da <- data.frame(userid = c(1,1,1,1,2,2,2,2), status = c(0,0,0,1,1,1,0,0), age = c(10,10,10,11,15,16,16,16), points = c(2,2,2,6,3,5,5,5))
da
userid status age points
1 1 0 10 2
2 1 0 10 2
3 1 0 10 2
4 1 1 11 6
5 2 1 15 3
6 2 1 16 5
7 2 0 16 5
8 2 0 16 5
I would like to have:
da2
userid status age points
1 1 0 10 6
2 1 1 11 6
3 2 1 15 3
4 2 1 16 5
5 2 0 16 10
da %>%
mutate(grp = with(rle(status),
rep(seq_along(values), lengths)) + cumsum(status != 0)) %>%
group_by_at(vars(-points)) %>%
summarise(points = sum(points)) %>%
ungroup() %>%
select(-grp)
## A tibble: 5 x 4
# userid status age points
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 10 6
#2 1 1 11 6
#3 2 0 16 10
#4 2 1 15 3
#5 2 1 16 5
You can use group_by from dplyr:
da %>% group_by(da$userid, cumsum(da$status), da$status)
%>% summarise(age=max(age), points=sum(points))
Output:
`da$userid` `cumsum(da$status)` `da$status` age points
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 6
2 1 1 1 11 6
3 2 2 1 15 3
4 2 3 0 16 10
5 2 3 1 16 5
Exactly the same idea as above :
library(dplyr)
data1 <- data %>% group_by(userid, age, status) %>%
filter(status == 0) %>%
summarise(points = sum(points))
data2 <- data %>%
group_by(userid, age, status) %>%
filter(status != 0) %>%
summarise(points = sum(points))
data <- rbind(data1,
data2)
We need to be more carreful with your specification of status equal to 0. I think the code of Quang Hoang works only for your specific example.
I hope it will help.

Resources