Expand table of counts to a dataframe - r

Given a table of counts specified in 'dat' I would like to create a dataframe with 3 columns (race, grp and outcome) and 206 rows. The variable outcome would be 1 if for ascertained, and 0 if 'missed'.
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))

1) For each row set race in the output to that race, grp in the output to that group and then generate the appropriate number of 1s and 0s for outcome. The result is 206 x 3.
library(dplyr)
dat %>%
rowwise %>%
summarize(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
2) In the example data there are no duplicate race/grp and if that is true in general then it can alternately be written as::
dat %>%
group_by(race, grp) %>%
summarize(outcome = rep(1:0, c(ascertained, missed)), .groups = "drop")
3) A base R solution would be the following. If each combination of race/grp occurs on only one row of the input then 1:nrow(dat) could optionally be replaced with dat[1:2].
do.call("rbind",
by(dat,
1:nrow(dat),
with,
data.frame(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
)
)

How about this:
library(tidyverse)
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
dat2 <- dat %>% select(-total) %>%
pivot_longer(c(ascertained, missed), names_to = "var", values_to="vals") %>%
uncount(vals) %>%
mutate(outcome = case_when(var == "ascertained" ~ 1,
TRUE ~ 0)) %>%
select(-var)
head(dat2)
#> # A tibble: 6 × 3
#> race grp outcome
#> <fct> <fct> <dbl>
#> 1 black hbpm 1
#> 2 black hbpm 1
#> 3 black hbpm 1
#> 4 black hbpm 1
#> 5 black hbpm 1
#> 6 black hbpm 1
dat2 %>%
group_by(race, grp, outcome) %>%
tally()
#> # A tibble: 8 × 4
#> # Groups: race, grp [4]
#> race grp outcome n
#> <fct> <fct> <dbl> <int>
#> 1 black hbpm 0 5
#> 2 black hbpm 1 63
#> 3 black uc 0 49
#> 4 black uc 1 24
#> 5 nonblack hbpm 0 3
#> 6 nonblack hbpm 1 32
#> 7 nonblack uc 0 9
#> 8 nonblack uc 1 21

This is based partially on the linked question from Limey in the comments:
library(tidyverse)
bind_rows(
dat %>% uncount(ascertained) %>% mutate(outcome = 1) %>% select(-missed, -total),
dat %>% uncount(missed) %>% mutate(outcome = 0) %>% select(-ascertained, -total)
)

Here is a relatively simple answer that is based on, in part, the answer suggested in a comment, but adapted to work for your problem, since you need multiple "uncounts". This answer uses function from the packages tibble, dplyr, and tidyr. These are all in the tidyverse.
The exact method is to create two sub-lists, one listing out the "ascertained", and one listing out the "missed", formatting the ascertained column as you wanted, and then mashing these two together with a basic tibble::add_row.
The relevant code is:
library(tidyverse)
dat2 <- uncount(dat, ascertained, .remove = F) %>%
mutate(ascertained = 1) %>%
select(-missed)
dat3 <- uncount(dat, missed, .remove = T) %>%
mutate(ascertained = 0)
dat4 <- add_row(dat2, dat3) %>% select(-total) %>%
rename(outcome = ascertained)
dat4 should be the data as you asked for it. I would suggest also generating an id column to make things easier to work with, but obviously that is up to you.

Related

Coalescing multiple columns from both the left and right side

Given the following data
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
I'm trying to coalesce all columns sharing the same initial prefix name (i.e. coalesce alpha_1, alpha_2, alpha_3, alpha_4, and coalesce beta_1 beta_2, etc.), but from both the left and right sides. That is, I want to generate two new variables, say 'alpha_left' and 'alpha_right', whose columns would be, in this example, (2, 2, 3) and (3, 4, 2) respectively (first non-missing elements from the left and right side of the dataframe).
User #akrun offered a great solution for the coalescing part here, but I'm unsure how to create two new variables from both the left and right coalesces.
Here is an option in tidyverse
Reshape to 'long' format - pivot_longer
Grouped by 'ID'
Do the summarise across the columns 'alpha' till 'charlie'
Get the column name - cur_column()
Create a tibble with the first non-NA element from the left and the right
Change the column names by appending the 'nm1' as prefix
Finally, unnest the list columns created in summarise
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = contains("_"),
names_to = c( ".value", "grp"), names_sep = "_") %>%
group_by(ID) %>%
summarise(across(alpha:charlie, ~ {
nm1 <- cur_column()
tbl1 <- tibble(left= .[complete.cases(.)][1],
right = rev(.)[complete.cases(rev(.))][1]);
names(tbl1) <- str_c(nm1, "_", names(tbl1))
list(tbl1)})) %>%
unnest(c(alpha, beta, charlie))
-output
# A tibble: 3 x 7
ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
<int> <int> <int> <int> <int> <int> <int>
1 1 2 3 3 3 1 1
2 2 2 4 2 2 2 2
3 3 3 2 2 2 1 1
Or using base R
lst1 <- lapply(split.default(df1[-1], sub("_\\d+$", "", names(df1)[-1])),
function(x) {
x1 <- apply(x, 1, function(y) {
y1 <- na.omit(y)
if(length(y1) > 1 ) y1[c(1, length(y1))] else y1[1]
})
if(is.vector(x1)) as.data.frame(matrix(x1)) else as.data.frame(t(x1))
})
You could also do:
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~data.frame(right = coalesce(!!!.x),
left = coalesce(!!!rev(.x))) %>%
set_names(paste(.y, names(.), sep="_")))
alpha_right alpha_left beta_right beta_left charlie_right charlie_left
1 2 3 3 3 1 1
2 2 4 2 2 2 2
3 3 2 2 2 1 1
One more approach not as elegant as #Onyambu's
library(tidyverse)
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~ .x %>% rowwise() %>%
mutate(!!paste0(.y, '_left') := head(na.omit(c_across(everything())),1),
!!paste0(.y, '_right') := tail(na.omit(c_across(!last_col())),1),
.keep = 'none' )
)
#> # A tibble: 3 x 6
#> # Rowwise:
#> alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int>
#> 1 2 3 3 3 1 1
#> 2 2 4 2 2 2 2
#> 3 3 2 2 2 1 1
Created on 2021-06-19 by the reprex package (v2.0.0)
Another option
library(tidyverse)
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
df1 %>%
pivot_longer(cols = -ID, names_sep = "_", names_to = c(".value", "set")) %>%
group_by(ID) %>%
fill(alpha:charlie, .direction = "updown") %>%
filter(set %in% range(set)) %>%
mutate(set = c("left", "right")) %>%
pivot_wider(id_cols = ID, names_from = set, values_from = alpha:charlie)
#> # A tibble: 3 x 7
#> # Groups: ID [3]
#> ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 3 3 1 1
#> 2 2 2 4 2 2 2 2
#> 3 3 3 2 2 2 1 1
Created on 2021-06-20 by the reprex package (v2.0.0)

group categories according to whether they meet a sequence of conditions with tidyverse

I have a problem on how to recategorize a variable according to whether it meets a certain condition or not. That is, if the category does not meet the criteria, it is assigned to another category that does.
My data has the following form:
data = data.frame(firm_size = c("Micro", "Small", "Medium","Big"),
employees = c(5,10,100,1000))
> data
firm_size employees
1 Micro 5
2 Small 10
3 Medium 100
4 Big 1000
So, if my condition is that I must group the companies that have less than 10 employees and then combine them with the other category that does meet the criteria
> new_data
firm_size employees
1 Micro-Small 15
3 Medium 100
4 Big 1000
What I'm trying to do is write a function that generalizes this procedure, for example, that also works if my data is
> data
firm_size employees
1 Micro 5
2 Small 8
3 Medium 9
4 Big 1000
> new_data
firm_size employees
1 Micro-Small-Medium 22
4 Big 1000
I think that this can be done with the tools of the tidyverse.
Thanks in advance
Here's an approach with tally:
library(dplyr)
size <- 10
data %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 3 x 2
# firm_size employees
# <chr> <dbl>
#1 Big 1000
#2 Medium 100
#3 Small-Micro 15
And for your second set of data:
data2 %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 2 x 2
# firm_size employees
# <chr> <int>
#1 Big 1000
#2 Medium-Small-Micro 22
Data
data <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5,
10, 100, 1000)), class = "data.frame", row.names = c(NA, -4L))
data2 <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5L,
8L, 9L, 1000L)), class = "data.frame", row.names = c("1", "2",
"3", "4"))
You can use the great forcats package
library(tidyverse)
data <- data.frame(
firm_size = c("Micro", "Small", "Medium", "Big", "Small"),
employees = c(5, 10, 100, 1000, 10)
)
# If you need n groups
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump(n = 2, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 3 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Medium 100
#> 2 Big 1000
#> 3 Other 25
# If you need at least x on the sum of a vector
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump_min(min = 10, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 4 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Small 20
#> 2 Medium 100
#> 3 Big 1000
#> 4 Other 5
Created on 2020-06-11 by the reprex package (v0.3.0)
Yet another solution, set into a custom function:
library(tidyverse)
mymerge <- function(dat, min) {
merged_dat <- dat %>%
filter(if_else(employees <= min, TRUE, FALSE)) %>%
summarize(firm_size = str_flatten(firm_size, collapse = " - "),
employees = sum(employees))
dat %>%
filter(if_else(employees <= min, FALSE, TRUE)) %>%
bind_rows(merged_dat)
}
mymerge(data, 30)
firm_size employees
1 Medium 100
2 Big 1000
3 Micro - Small 15
mymerge(data, 300)
firm_size employees
1 Big 1000
2 Micro - Small - Medium 115

R Replace NA for all Columns Except *

library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#> # A tibble: 4 x 5
#> Date col1 thisCol thatCol col999
#> <date> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 1 NA 25 99
#> 2 2020-01-01 2 8 26 99
#> 3 2020-01-01 3 NA 27 99
#> 4 NA 4 3 28 99
My actual R data frame has hundreds of columns that aren't neatly named, but can be approximated by the df data frame above.
I want to replace all values of NA with 0, with the exception of several columns (in my example I want to leave out the Date column and the thatCol column. I'd want to do it in this sort of fashion:
df %>% replace(is.na(.), 0)
#> Error: Assigned data `values` must be compatible with existing data.
#> i Error occurred for column `Date`.
#> x Can't convert <double> to <date>.
#> Run `rlang::last_error()` to see where the error occurred.
And my unsuccessful ideas for accomplishing the "everything except" replace NA are shown below.
df %>% replace(is.na(c(., -c(Date, thatCol)), 0))
df %>% replace_na(list([, c(2:3, 5)] = 0))
df %>% replace_na(list(everything(-c(Date, thatCol)) = 0))
Is there a way to select everything BUT in the way I need to? There's hundred of columns, named inconsistently, so typing them one by one is not a practical option.
You can use mutate_at :
library(dplyr)
Remove them by Name
df %>% mutate_at(vars(-c(Date, thatCol)), ~replace(., is.na(.), 0))
Remove them by position
df %>% mutate_at(-c(1,4), ~replace(., is.na(.), 0))
Select them by name
df %>% mutate_at(vars(col1, thisCol, col999), ~replace(., is.na(.), 0))
Select them by position
df %>% mutate_at(c(2, 3, 5), ~replace(., is.na(.), 0))
If you want to use replace_na
df %>% mutate_at(vars(-c(Date, thatCol)), tidyr::replace_na, 0)
Note that mutate_at is soon going to be replaced by across in dplyr 1.0.0.
You have several options here based on data.table.
One of the coolest options: setnafill (version >= 1.12.4):
library(data.table)
setDT(df)
data.table::setnafill(df,fill = 0, cols = colnames(df)[!(colnames(df) %in% c("Date", thatCol)]))
Note that your dataframe is updated by reference.
Another base solution:
to_change<-grep("^(this|col)",names(df))
df[to_change]<- sapply(df[to_change],function(x) replace(x,is.na(x),0))
df
# A tibble: 4 x 5
Date col1 thisCol thatCol col999
<date> <dbl> <dbl> <int> <dbl>
1 2020-01-01 1 0 25 99
2 2020-01-01 2 8 26 99
3 2020-01-01 3 0 27 99
4 NA 0 3 28 99
Data(I changed one value):
df <- structure(list(Date = structure(c(18262, 18262, 18262, NA), class = "Date"),
col1 = c(1L, 2L, 3L, NA), thisCol = c(NA, 8, NA, 3), thatCol = 25:28,
col999 = c(99, 99, 99, 99)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
replace works on a data.frame, so we can just do the replacement by index and update the original dataset
df[-c(1, 4)] <- replace(df[-c(1, 4)], is.na(df[-c(1, 4)]), 0)
Or using replace_na with across (from the new dplyr)
library(dplyr)
library(tidyr)
df %>%
mutate(across(-c(Date, thatCol), ~ replace_na(., 0)))
If you know the ones that you don't want to change, you could do it like this:
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#dplyr
df_nonreplace <- select(df, c("Date", "thatCol"))
df_replace <- df[ ,!names(df) %in% names(df_nonreplace)]
df_replace[is.na(df_replace)] <- 0
df <- cbind(df_nonreplace, df_replace)
> head(df)
Date thatCol col1 thisCol col999
1 2020-01-01 25 1 0 99
2 2020-01-01 26 2 8 99
3 2020-01-01 27 3 0 99
4 <NA> 28 4 3 99

Creating an interval in for frequency table in R

I have a dataframe I've created in the form
FREQ CNT
0 5
1 20
2 1000
3 3
4 3
I want to further group my results to be in the following form:
CUT CNT
0+1 25
2+3 1003
4+5 ...
.....
I've tried using the between and cut functions in dplyr but it just adds a new interval column to my dataframe can anyone give me a good indication as to where to go to achieve this?
Here is a way to do it in dplyr:
library(dplyr)
df <- df %>%
mutate(id = 1:n()) %>%
mutate(new_freq = ifelse(id %% 2 != 0, paste0(FREQ, "+", lead(FREQ, 1)), paste0(lag(FREQ, 1), "+", FREQ)))
df <- df %>%
group_by(new_freq) %>%
mutate(new_cnt = sum(CNT))
unique(df[, 4:5])
# A tibble: 2 x 2
# Groups: new_freq [2]
# new_freq new_cnt
# <chr> <int>
#1 0+1 25
#2 2+3 1003
data
df <- structure(list(FREQ = 0:3, CNT = c(5L, 20L, 1000L, 3L)), class = "data.frame", row.names = c(NA, -4L))
A non-elegant solution using dplyr... probably a better way to do this.
dat <- data.frame(FREQ = c(0,1,2,3,4), CNT = c(5,20,1000, 3, 3))
dat2 <- dat %>%
mutate(index = 0:(nrow(dat)-1)%/%2) %>%
group_by(index)
dat2 %>%
summarise(new_CNT = sum(CNT)) %>%
left_join(dat2 %>%
mutate(CUT = paste0(FREQ[1], "+", FREQ[2])) %>%
distinct(index, CUT),
by = "index") %>%
select(-index)
# A tibble: 3 x 2
new_CNT CUT
<dbl> <chr>
1 25 0+1
2 1003 2+3
3 3 4+NA

Pmax of columns ending with a given string

I would like to conditionally mutate a new column representing the pmax() of columns ending with "_n" for a given row. I know I can do this by explicitly specifying the column names, but I would prefer to have this be the result of a call to ends_with() or similar.
I have tried mutate_at() and plain mutate(). My general thought is that I need to pass a vars(ends_with("_n")) to something, but I'm just missing that something.
Thanks in advance.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(vs, gear) %>%
summarize(mean = mean(disp),
sd = sd(disp),
n = n()) %>%
mutate_if(is.double, round, 1) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")")) %>%
select(-mean, -sd) %>%
group_by(vs, gear) %>%
nest(n, mean_sd, .key = "summary") %>%
spread(key = vs, value = summary) %>%
unnest(`0`, `1`, .sep = "_")
gear `0_n` `0_mean_sd` `1_n` `1_mean_sd`
<dbl> <int> <chr> <int> <chr>
1 3 12 357.6 (71.8) 3 201 (72)
2 4 2 160 (0) 10 115.6 (38.5)
3 5 4 229.3 (113.9) 1 95.1 (NA)
edit: both answers are much appreciated. Cheers!
Here's one way using the unquote-splice operator. We can select columns that we want to compare and then splice them as vectors into pmax:
library(tidyverse)
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl %>%
mutate(pmax = pmax(!!!select(., ends_with("_n"))))
#> # A tibble: 3 x 6
#> gear `0_n` `0_mean_sd` `1_n` `1_mean_sd` pmax
#> <dbl> <int> <chr> <int> <chr> <int>
#> 1 3 12 357.6 (71.8) 3 201 (72) 12
#> 2 4 2 160 (0) 10 115.6 (38.5) 10
#> 3 5 4 229.3 (113.9) 1 95.1 (NA) 4
Created on 2019-04-23 by the reprex package (v0.2.1)
A base R version, just as an alternative:
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl$pmax <- do.call(pmax,as.list(dat[,grepl("_n$",names(dat))]))

Resources