Summarize output of table to simple columns - r

I currently have this table and I want to sum the total number of purchases per each ID.
Input:
id
purchases
time
a
need
1:00
a
want
1:30
a
none
2:00
b
need
1:15
b
want
1:30
c
none
1:10
c
none
1:30
d
none
2:00
d
need
2:10
d
want
2:15
d
none
2:35
e
none
3:10
e
none
3:50
f
need
2:55
f
want
3:15
f
need
3:20
the purchases column was primarily not existent and instead there were item names. so I created this column first and then proceeded to try to reach the below output
Desired first output: total items bought, number of needs and wants separately, the output column is yes if first purchase is a need, no if it isn't, none if there were no purchases
id
total
need
want
output
a
2
1
1
yes
b
2
1
1
yes
c
0
0
0
none
d
2
1
1
no
e
0
0
0
none
f
3
2
1
yes
I am using dplyr so I would appreciate the suggested code to be feasible for adding in a dplyr pipeline.
What I tried to do
actions %>% group_by (id) %>% arrange(id) %>%
mutate(purchases = ifelse(type == "Buy" & obj_category == "Books" | type == "Buy" & obj_category == "Car" | type=="Buy" & obj_category == "Business" | type == "Buy", "need",
ifelse(type == "Buy" & obj_category == "Sweets" | type == "Buy" & obj_category == "Electronics" | type == "Buy" & obj_category == "Business" | type == "Buy" & obj_category == "House", "want", "none"))) %>%
summarise(need = ifelse(purchases == "need", 1, 0),
want = ifelse(purchases == "want", 1, 0))
thank you in advance

You could try
library(dplyr)
df %>%
group_by(id) %>%
summarise(need = sum(purchases == "need"),
want = sum(purchases == "want"),
total = need + want,
output = case_when(first(purchases) == "need" ~ "yes",
total == 0 ~ "none",
TRUE ~ "no"))
# # A tibble: 6 × 5
# id need want total output
# <chr> <int> <int> <int> <chr>
# 1 a 1 1 2 yes
# 2 b 1 1 2 yes
# 3 c 0 0 0 none
# 4 d 1 1 2 no
# 5 e 0 0 0 none
# 6 f 2 1 3 yes
A general version if there are more categories in purchases:
library(dplyr)
library(janitor)
df %>%
tabyl(id, purchases) %>%
select(-none) %>%
adorn_totals("col") %>%
left_join(
df %>% group_by(id) %>%
summarise(output = case_when(purchases[1] == "need" ~ "yes",
all(purchases == "none") ~ "none",
TRUE ~ "no")))
Data
df <- structure(list(id = c("a", "a", "a", "b", "b", "c", "c", "d",
"d", "d", "d", "e", "e", "f", "f", "f"), purchases = c("need",
"want", "none", "need", "want", "none", "none", "none", "need",
"want", "none", "none", "none", "need", "want", "need"), time = c("1:00",
"1:30", "2:00", "1:15", "1:30", "1:10", "1:30", "2:00", "2:10",
"2:15", "2:35", "3:10", "3:50", "2:55", "3:15", "3:20")), class = "data.frame", row.names = c(NA, -16L))

Here's a solution with dplyr and janitor:
library(dplyr)
library(janitor)
df %>%
janitor::tabyl(id, purchases) %>%
left_join(df %>% group_by(id) %>% slice(1), by = "id") %>%
rowwise() %>%
mutate(total = sum(c_across(need:want))) %>%
ungroup() %>%
mutate(purchases = ifelse(purchases == "need", "yes", "no"),
purchases = ifelse(total == 0, "none", purchases)) %>%
select(-c(time, total))
Which gives:
# A tibble: 6 × 5
id need none want purchases
<chr> <dbl> <dbl> <dbl> <chr>
1 a 1 1 1 yes
2 b 1 0 1 yes
3 c 0 2 0 no
4 d 1 2 1 no
5 e 0 2 0 no
6 f 2 0 1 yes

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

case_when when there are factors

I am trying to combine treatment allocations for patients who completed two different randomisation forms. I can simulate some example data here:
data <- data.frame(id = 1:100,
trt_a = factor(c(sample(0:1, 50, TRUE), rep(NA, 50))),
trt_b = factor(c(sample(0:1, 50, TRUE), rep(NA, 50))),
trt_ab = factor(c(rep(NA, 50), sample(c("a", "b", "ab", "neither"), 50, TRUE))))
Is there any way of creating a new column with the same factor levels as trt_ab? Half the patients had choice of either trt_a or trt_b, and the other half had choice trt_ab. I want to use some sort of case_when statement to generate a new column with the actual treatment choices:
data %>%
mutate(trt = case_when(trt_a == 0 & trt_b == 0 ~ "neither",
trt_a == 1 & trt_b == 0 ~ "a",
trt_a == 0 & trt_b == 1 ~ "b",
trt_a == 1 & trt_b == 1 ~ "ab",
!is.na(trt_ab) ~ trt_ab))
However, when any of the columns are factors, I get the following error:
Error in `mutate()`:
! Problem while computing `trt = case_when(...)`.
Caused by error in `` names(message) <- `*vtmp*` ``:
! 'names' attribute [1] must be the same length as the vector [0]
data %>%
mutate(trt = case_when(trt_a == 0 & trt_b == 0 ~ "neither",
trt_a == 1 & trt_b == 0 ~ "a",
trt_a == 0 & trt_b == 1 ~ "b",
trt_a == 1 & trt_b == 1 ~ "ab",
!is.na(trt_ab) ~ trt_ab)) %>% head
-output
id trt_a trt_b trt_ab trt
1 1 0 0 <NA> neither
2 2 0 0 <NA> neither
3 3 1 1 <NA> ab
4 4 1 1 <NA> ab
5 5 0 1 <NA> b
6 6 1 1 <NA> ab

Complete and fill data.frame with multiple conditions

I want to complete a data.frame with all combinations of two variables but with two conditions.
Here is my data.frame:
Data <-
data.frame(
A = rep(c("a", "b", "c", "d"), each = 2),
N = 1,
Type = c("i", "i", "I", "i", "i", "i", "I", "I")
)
> Data
A N Type
1 a 1 i
2 a 1 i
3 b 1 I
4 b 1 i
5 c 1 i
6 c 1 i
7 d 1 I
8 d 1 I
Now I want to complete that data.frame with all combinations of A and Type, but only if A != "a" and Type == "I". So there only has to be one additional row, the row with A == "c" and Type == "I". Furthermore, N should be filled with 0, see my desired output below. Is there an elegant way to achieve this? It would be great with tidyr's complete but it's OK if not. If the order was like here it would be even better.
> Data
A N Type
1 a 1 i
2 a 1 i
3 b 1 I
4 b 1 i
5 c 1 i
6 c 1 i
7 c 0 I
8 d 1 I
9 d 1 I
Perhaps, you can try this -
library(dplyr)
library(tidyr)
Data %>%
mutate(Type = factor(Type)) %>%
filter(!A %in% A[Type == "I"], A != 'a') %>%
complete(A, Type, fill = list(N = 0)) %>%
bind_rows(Data %>% filter(A %in% A[Type == "I"] | A == 'a')) %>%
arrange(A, -N)
# A Type N
# <chr> <chr> <dbl>
#1 a i 1
#2 a i 1
#3 b I 1
#4 b i 1
#5 c i 1
#6 c i 1
#7 c I 0
#8 d I 1
#9 d I 1
filter for the combination that you are interested in (filter(!A %in% A[Type == "I"], A != 'a')), complete those A values, bind them to the remaining rows and arrange.
You could do something like this
library(tidyverse)
Data <-
data.frame(
A = rep(c("a", "b", "c", "d"), each = 2),
N = 1,
Type = c("i", "i", "I", "i", "i", "i", "I", "I")
)
Data %>%
bind_rows(
complete(Data, A, nesting(Type)) %>%
filter(A != "a" & Type == "I" & is.na(N))
) %>%
mutate(N = replace_na(N, 0)) %>%
arrange(A, -N)
I filter on is.na(N) to ensure i get only "new" rows added.

How to count the frequency of unique factor across each row in r dataframe

I have a dataset like the following:
Age Monday Tuesday Wednesday
6-9 a b a
6-9 b b c
6-9 c a
9-10 c c b
9-10 c a b
Using R, I want to get the following data set/ results (where each column represents the total frequency of each of the unique factor):
Age a b c
6-9 2 1 0
6-9 0 2 1
6-9 1 0 1
9-10 0 1 2
9-10 1 1 1
Note: My data also contains missing values
couple of quick and dirty tidyverse solutions - there should be a way to reduce steps though.
library(tidyverse) # install.packages("tidyverse")
input <- tribble(
~Age, ~Monday, ~Tuesday, ~Wednesday,
"6-9", "a", "b", "a",
"6-9", "b", "b", "c",
"6-9", "", "c", "a",
"9-10", "c", "c", "b",
"9-10", "c", "a", "b"
)
# pivot solution
input %>%
rowid_to_column() %>%
mutate_all(function(x) na_if(x, "")) %>%
pivot_longer(cols = -c(rowid, Age), values_drop_na = TRUE) %>%
count(rowid, Age, value) %>%
pivot_wider(id_cols = c(rowid, Age), names_from = value, values_from = n, values_fill = list(n = 0)) %>%
select(-rowid)
# manual solution (if only a, b, c are expected as options)
input %>%
unite(col = "combine", Monday, Tuesday, Wednesday, sep = "") %>%
transmute(
Age,
a = str_count(combine, "a"),
b = str_count(combine, "b"),
c = str_count(combine, "c")
)
In base R, we can replace empty values with NA, get unique values in the dataframe, and use apply row-wise and count the occurrence of values using table.
df[df == ''] <- NA
vals <- unique(na.omit(unlist(df[-1])))
cbind(df[1], t(apply(df, 1, function(x) table(factor(x, levels = vals)))))
# Age a b c
#1 6-9 2 1 0
#2 6-9 0 2 1
#3 6-9 1 0 1
#4 9-10 0 1 2
#5 9-10 1 1 1

How do you exclude values when creating a string when setting up initial conditions?

I'm trying to combine columns in my data frame so that they give me a certain string. I have columns titled as "C", "H", "O", "N", and "S" as elements. Within those columns are listed the number of elements within that molecule, but I want to exclude some elements depending on their value. For example when there is no Oxygens the value is 0, so i want to exclude this when combining the elements to make a string.
#This is a portion of my data frame titled data4a
C H O N S
3 4 0 0 1
7 5 4 1 0
#The code I have is
data4a$NewComp = paste("C",data4a$Total.C,"H", data4a$NewH, "O", data4a$O, "N", data4a$N, "S", data4a$S, sep = "")
#This code gives me this
C H O N S NewComp
3 4 0 0 1 C3H4O0N0S1
7 5 4 1 0 C7H5O4N1S0
#I expect to see something like this when I print my results
C H O N S NewComp
3 4 0 0 1 C3H4S1
7 5 4 1 0 C7H5O4N
#I want values of zero to be excluded from the string created
An option is apply with argument MARGIN = 1
dat$NewComp <- apply(dat, 1, function(x) {
tmp <- unlist(x)
paste0(names(x)[tmp != 0], tmp[tmp != 0], collapse = "")
})
Result
dat
# C H O N S NewComp
#1 3 4 0 0 1 C3H4S1
#2 7 5 4 1 0 C7H5O4N1
data
dat <- structure(list(C = c(3L, 7L), H = 4:5, O = c(0L, 4L), N = 0:1,
S = c(1L, 0L)), .Names = c("C", "H", "O", "N", "S"), class = "data.frame", row.names = c(NA,
-2L))
Here is a base R solution that solves the question problem and simplifies the creation of the molecule vectors at the same time.
m <- matrix(paste0(names(data4a), t(as.matrix(data4a))),
ncol = ncol(data4a), byrow = TRUE)
m <- apply(m, 1, paste, collapse = "")
data4a$NewComp <- gsub(".0", "", m)
data4a
# C H O N S NewComp
#1 3 4 0 0 1 C3H4S1
#2 7 5 4 1 0 C7H5O4N1
Data.
data4a <- read.table(text = "
C H O N S
3 4 0 0 1
7 5 4 1 0
", header = TRUE)
Another approach could be to use which and create a new dataframe with row number column number and value of the data which is not 0. We then replace the column number with column names and then use aggregate by row number to paste formula together.
df1 <- which(df != 0, arr.ind = TRUE)
df2 <- cbind.data.frame(df1, value = df[df != 0])
df2$col <- names(df)[df2$col]
df$NewComp <- aggregate(paste0(df2$col, df2$value), list(df2$row),
paste0, collapse = "")[, 2]
df
# C H O N S NewComp
#1 3 4 0 0 1 C3H4S1
#2 7 5 4 1 0 C7H5O4N1
As it has been mentioned in comments of other answer if you have data only in selected columns use df[selected_columns] in the first statement of which.
One possibility involving tidyverse could be:
df %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
filter(val != 0) %>%
group_by(rowid) %>%
summarise(NewComp = paste0(paste0(var, val), collapse = "")) %>%
left_join(df %>%
rowid_to_column(), by = c("rowid" = "rowid")) %>%
ungroup() %>%
select(-rowid)
NewComp C H O N S
<chr> <int> <int> <int> <int> <int>
1 C3H4S1 3 4 0 0 1
2 C7H5O4N1 7 5 4 1 0
Or:
df %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
filter(val != 0) %>%
group_by(rowid) %>%
mutate(NewComp = paste0(paste0(var, val), collapse = "")) %>%
spread(var, val, fill = 0) %>%
ungroup() %>%
select(-rowid)
Sample data:
df <- read.table(text = "C H O N S
3 4 0 0 1
7 5 4 1 0",
header = TRUE,
stringsAsFactors = FALSE)

Resources