Related
Given a table of counts specified in 'dat' I would like to create a dataframe with 3 columns (race, grp and outcome) and 206 rows. The variable outcome would be 1 if for ascertained, and 0 if 'missed'.
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
1) For each row set race in the output to that race, grp in the output to that group and then generate the appropriate number of 1s and 0s for outcome. The result is 206 x 3.
library(dplyr)
dat %>%
rowwise %>%
summarize(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
2) In the example data there are no duplicate race/grp and if that is true in general then it can alternately be written as::
dat %>%
group_by(race, grp) %>%
summarize(outcome = rep(1:0, c(ascertained, missed)), .groups = "drop")
3) A base R solution would be the following. If each combination of race/grp occurs on only one row of the input then 1:nrow(dat) could optionally be replaced with dat[1:2].
do.call("rbind",
by(dat,
1:nrow(dat),
with,
data.frame(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
)
)
How about this:
library(tidyverse)
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
dat2 <- dat %>% select(-total) %>%
pivot_longer(c(ascertained, missed), names_to = "var", values_to="vals") %>%
uncount(vals) %>%
mutate(outcome = case_when(var == "ascertained" ~ 1,
TRUE ~ 0)) %>%
select(-var)
head(dat2)
#> # A tibble: 6 × 3
#> race grp outcome
#> <fct> <fct> <dbl>
#> 1 black hbpm 1
#> 2 black hbpm 1
#> 3 black hbpm 1
#> 4 black hbpm 1
#> 5 black hbpm 1
#> 6 black hbpm 1
dat2 %>%
group_by(race, grp, outcome) %>%
tally()
#> # A tibble: 8 × 4
#> # Groups: race, grp [4]
#> race grp outcome n
#> <fct> <fct> <dbl> <int>
#> 1 black hbpm 0 5
#> 2 black hbpm 1 63
#> 3 black uc 0 49
#> 4 black uc 1 24
#> 5 nonblack hbpm 0 3
#> 6 nonblack hbpm 1 32
#> 7 nonblack uc 0 9
#> 8 nonblack uc 1 21
This is based partially on the linked question from Limey in the comments:
library(tidyverse)
bind_rows(
dat %>% uncount(ascertained) %>% mutate(outcome = 1) %>% select(-missed, -total),
dat %>% uncount(missed) %>% mutate(outcome = 0) %>% select(-ascertained, -total)
)
Here is a relatively simple answer that is based on, in part, the answer suggested in a comment, but adapted to work for your problem, since you need multiple "uncounts". This answer uses function from the packages tibble, dplyr, and tidyr. These are all in the tidyverse.
The exact method is to create two sub-lists, one listing out the "ascertained", and one listing out the "missed", formatting the ascertained column as you wanted, and then mashing these two together with a basic tibble::add_row.
The relevant code is:
library(tidyverse)
dat2 <- uncount(dat, ascertained, .remove = F) %>%
mutate(ascertained = 1) %>%
select(-missed)
dat3 <- uncount(dat, missed, .remove = T) %>%
mutate(ascertained = 0)
dat4 <- add_row(dat2, dat3) %>% select(-total) %>%
rename(outcome = ascertained)
dat4 should be the data as you asked for it. I would suggest also generating an id column to make things easier to work with, but obviously that is up to you.
I want to combine/reduce a list of dataframes into one dataframe, but I also want to summarize the data in one step. The output is from a simulation; therefore, each dataframe has the same output structure (i.e., a Group column, then 2 columns with values, which will have values that vary for each output).
Minimal Reproducible Example
df_list <- list(structure(list(Group = c("A", "B", "C"), Top_Group = c(1L,
0L, 0L), Efficiency = c(0.464688158128411, 0.652386676520109,
0.282913417555392)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(Group = c("A", "B", "C"
), Top_Group = c(0L, 1L, 0L), Efficiency = c(0.120292583014816,
0.0356206290889531, 0.37196880299598)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame")), structure(list(
Group = c("A", "B", "C"), Top_Group = c(0L, 1L, 0L), Efficiency = c(0.261322160949931,
0.383351784432307, 0.754808459430933)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")))
What I Have Tried
I know I could bind the data together, then group and summarize.
library(tidyverse)
df_list %>%
bind_rows() %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency))
# Group Top_Group Efficiency
# <chr> <int> <dbl>
#1 A 1 0.465
#2 B 2 0.652
#3 C 0 0.755
I was hoping that there was someway to use something like reduce; however, I can only get it to work for pulling out one column (like Top_Group shown here), and am unsure how to use across all columns (if possible) and return a dataframe instead of vectors.
df_list %>%
map(2) %>%
reduce(`+`)
# [1] 1 2 0
Expected Output
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
In base R you could just do
Reduce(function(a, b) cbind(a[1], a[2] + b[2], pmax(a[3], b[3])), df_list)
#> Group Top_Group Efficiency
#> 1 A 1 0.4646882
#> 2 B 2 0.6523867
#> 3 C 0 0.7548085
A base R option using aggregate + ave
aggregate(
. ~ Group,
transform(
do.call(
rbind,
df_list
),
Efficiency = ave(
Efficiency,
Group,
FUN = function(x) max(x) / length(x)
)
), sum
)
or aggregate + sapply
transform(
aggregate(. ~ Group, do.call(rbind, df_list), list),
Top_Group = sapply(Top_Group, sum),
Efficiency = sapply(Efficiency, max)
)
gives
Group Top_Group Efficiency
1 A 1 0.4646882
2 B 2 0.6523867
3 C 0 0.7548085
Based on the OP's code, different functions were used on different columns. So, we may have to individually apply those elementwise functions
library(purrr)
reduce(df_list, ~ tibble(.x[1], .x[2] + .y[2], pmax(.x[3], .y[3])))
-output
# A tibble: 3 × 3
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Yet another solution with reduce, fulljoin, and then a rowwise summarize:
library(tidyverse)
df_list %>%
reduce(full_join, by = "Group") %>%
rowwise() %>%
summarize(Group = Group,
Top_Group = sum(c_across(starts_with("Top_Group"))),
Efficiency = max(c_across(starts_with("Efficiency")))) %>%
ungroup()
# A tibble: 3 x 3
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Another option is using data.table, where we can use rbindlist, then summarize the columns.
library(data.table)
rbindlist(df_list)[, list(Top_Group = sum(Top_Group),
Efficiency = max(Efficiency)), by = .(Group)]
Output
Group Top_Group Efficiency
1: A 1 0.4646882
2: B 2 0.6523867
3: C 0 0.7548085
Benchmark
Just out of curiosity (as this question is not about efficiency), I also ran all the current answers to see what is the fastest. The base R options are fast, but apparently the data.table option is the fastest.
Code
microbenchmark::microbenchmark(akrun = reduce(df_list, ~ tibble(.x[1], .x[2] + .y[2], pmax(.x[3], .y[3]))),
AllanCameron = Reduce(function(a, b) cbind(a[1], a[2] + b[2], pmax(a[3], b[3])), df_list),
ThomasIsCoding_agg_ave = {aggregate(
. ~ Group,
transform(
do.call(
rbind,
df_list
),
Efficiency = ave(
Efficiency,
Group,
FUN = function(x) max(x) / length(x)
)
), sum
)},
ThomasIsCoding_agg_sapply = {transform(
aggregate(. ~ Group, do.call(rbind, df_list), list),
Top_Group = sapply(Top_Group, sum),
Efficiency = sapply(Efficiency, max)
)
},
deschen = df_list %>%
reduce(full_join, by = "Group") %>%
rowwise() %>%
summarize(Group = Group,
Top_Group = sum(c_across(starts_with("Top_Group"))),
Efficiency = max(c_across(starts_with("Efficiency")))) %>%
ungroup(),
TomHoel = df_list %>%
tibble() %>%
unnest(cols = c(.)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency)),
AndrewGB_tidyverse = df_list %>%
bind_rows() %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency)),
AndrewGB_datatable = rbindlist(df_list)[, list(Top_Group = sum(Top_Group), Efficiency = max(Efficiency)), by=.(Group)],
times = 2000
)
You almost had it! Check out ?unnest()
require(tidyverse)
df_list %>%
tibble() %>%
unnest(cols = c(.)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group), Efficiency = max(Efficiency))
# A tibble: 3 x 3
Group Top_Group Efficiency
<chr> <int> <dbl>
1 A 1 0.465
2 B 2 0.652
3 C 0 0.755
Another base R, a few months late:
subset(
within(
do.call(rbind, df_list),
{
Top_Group <- ave(Top_Group, Group, FUN = sum)
Efficiency <- ave(Efficiency, Group, FUN = max)
}
),
!(duplicated(Group))
)
I have the following data:
ID cancer cancer_date stroke stroke_date diabetes diabetes_date
1 1 Feb2017 0 Jan2015 1 Jun2015
2 0 Feb2014 1 Jan2015 1 Jun2015
I would like to get
ID condition date
1 cancer xx
1 diabetes xx
2 stroke xx
2 diabetes xx
I tried reshape and gather, but it did not do what I want. Any ideas how can I do this?
This should do it. The key to make it work easily is to change the names of cancer, stroke and diabetes to x_val and then you can use pivot_longer() from tidyr to do the work.
library(tidyr)
library(dplyr)
dat <- tibble::tribble(
~ID, ~cancer, ~cancer_date, ~stroke, ~stroke_date, ~diabetes, ~diabetes_date,
1, 1, "Feb2017", 0, "Jan2015", 1, "Jun2015",
2, 0, "Feb2014", 1, "Jan2015", 1, "Jun2015")
dat %>%
rename("cancer_val" = "cancer",
"stroke_val" = "stroke",
"diabetes_val" = "diabetes") %>%
pivot_longer(cols=-ID,
names_to = c("diagnosis", ".value"),
names_pattern="(.*)_(.*)") %>%
filter(val == 1)
# # A tibble: 4 x 4
# ID diagnosis val date
# <dbl> <chr> <dbl> <chr>
# 1 1 cancer 1 Feb2017
# 2 1 diabetes 1 Jun2015
# 3 2 stroke 1 Jan2015
# 4 2 diabetes 1 Jun2015
library(data.table)
data <- data.table(ID = c(1, 2), cancer = c(1, 0), cancer_date = c("Feb2017", "Feb2014"), stroke = c(0, 1), stroke_date = c("Jan2015", "Jan2015"), diabetes = c(1, 1), diabetes_date = c("Jun2015", "Jun2015"))
datawide <-
melt(data, id.vars = c("ID", "cancer", "stroke", "diabetes"),
measure.vars = c("cancer_date", "stroke_date", "diabetes_date"))
datawide[(cancer == 1 & variable == "cancer_date") |
(stroke == 1 & variable == "stroke_date") |
(diabetes == 1 & variable == "diabetes_date"), .(ID, condition = variable, date = value)]
Try this solution using pivot_longer() and a flag variable to filter the desired states. After pivoting you can filter the values different to zero and only choose the one values. Here the code:
library(tidyverse)
#Code
df2 <- df %>% pivot_longer(cols = -c(ID,contains('_'))) %>%
filter(value!=0) %>% rename(condition=name) %>% select(-value) %>%
pivot_longer(-c(ID,condition)) %>%
separate(name,c('v1','v2'),sep='_') %>%
mutate(Flag=ifelse(condition==v1,1,0)) %>%
filter(Flag==1) %>% select(-c(v1,v2,Flag)) %>%
rename(date=value)
Output:
# A tibble: 4 x 3
ID condition date
<int> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
Some data used:
#Data
df <- structure(list(ID = 1:2, cancer = 1:0, cancer_date = c("Feb2017",
"Feb2014"), stroke = 0:1, stroke_date = c("Jan2015", "Jan2015"
), diabetes = c(1L, 1L), diabetes_date = c("Jun2015", "Jun2015"
)), class = "data.frame", row.names = c(NA, -2L))
If the first obtain is complex, here another choice:
#Code 2
df2 <- df %>% mutate(across(everything(),~as.character(.))) %>%
pivot_longer(cols = -c(ID)) %>%
separate(name,c('condition','v2'),sep = '_') %>%
replace(is.na(.),'val') %>%
pivot_wider(names_from = v2,values_from=value) %>%
filter(val==1) %>% select(-val)
Output:
# A tibble: 4 x 3
ID condition date
<chr> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#> # A tibble: 4 x 5
#> Date col1 thisCol thatCol col999
#> <date> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 1 NA 25 99
#> 2 2020-01-01 2 8 26 99
#> 3 2020-01-01 3 NA 27 99
#> 4 NA 4 3 28 99
My actual R data frame has hundreds of columns that aren't neatly named, but can be approximated by the df data frame above.
I want to replace all values of NA with 0, with the exception of several columns (in my example I want to leave out the Date column and the thatCol column. I'd want to do it in this sort of fashion:
df %>% replace(is.na(.), 0)
#> Error: Assigned data `values` must be compatible with existing data.
#> i Error occurred for column `Date`.
#> x Can't convert <double> to <date>.
#> Run `rlang::last_error()` to see where the error occurred.
And my unsuccessful ideas for accomplishing the "everything except" replace NA are shown below.
df %>% replace(is.na(c(., -c(Date, thatCol)), 0))
df %>% replace_na(list([, c(2:3, 5)] = 0))
df %>% replace_na(list(everything(-c(Date, thatCol)) = 0))
Is there a way to select everything BUT in the way I need to? There's hundred of columns, named inconsistently, so typing them one by one is not a practical option.
You can use mutate_at :
library(dplyr)
Remove them by Name
df %>% mutate_at(vars(-c(Date, thatCol)), ~replace(., is.na(.), 0))
Remove them by position
df %>% mutate_at(-c(1,4), ~replace(., is.na(.), 0))
Select them by name
df %>% mutate_at(vars(col1, thisCol, col999), ~replace(., is.na(.), 0))
Select them by position
df %>% mutate_at(c(2, 3, 5), ~replace(., is.na(.), 0))
If you want to use replace_na
df %>% mutate_at(vars(-c(Date, thatCol)), tidyr::replace_na, 0)
Note that mutate_at is soon going to be replaced by across in dplyr 1.0.0.
You have several options here based on data.table.
One of the coolest options: setnafill (version >= 1.12.4):
library(data.table)
setDT(df)
data.table::setnafill(df,fill = 0, cols = colnames(df)[!(colnames(df) %in% c("Date", thatCol)]))
Note that your dataframe is updated by reference.
Another base solution:
to_change<-grep("^(this|col)",names(df))
df[to_change]<- sapply(df[to_change],function(x) replace(x,is.na(x),0))
df
# A tibble: 4 x 5
Date col1 thisCol thatCol col999
<date> <dbl> <dbl> <int> <dbl>
1 2020-01-01 1 0 25 99
2 2020-01-01 2 8 26 99
3 2020-01-01 3 0 27 99
4 NA 0 3 28 99
Data(I changed one value):
df <- structure(list(Date = structure(c(18262, 18262, 18262, NA), class = "Date"),
col1 = c(1L, 2L, 3L, NA), thisCol = c(NA, 8, NA, 3), thatCol = 25:28,
col999 = c(99, 99, 99, 99)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
replace works on a data.frame, so we can just do the replacement by index and update the original dataset
df[-c(1, 4)] <- replace(df[-c(1, 4)], is.na(df[-c(1, 4)]), 0)
Or using replace_na with across (from the new dplyr)
library(dplyr)
library(tidyr)
df %>%
mutate(across(-c(Date, thatCol), ~ replace_na(., 0)))
If you know the ones that you don't want to change, you could do it like this:
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#dplyr
df_nonreplace <- select(df, c("Date", "thatCol"))
df_replace <- df[ ,!names(df) %in% names(df_nonreplace)]
df_replace[is.na(df_replace)] <- 0
df <- cbind(df_nonreplace, df_replace)
> head(df)
Date thatCol col1 thisCol col999
1 2020-01-01 25 1 0 99
2 2020-01-01 26 2 8 99
3 2020-01-01 27 3 0 99
4 <NA> 28 4 3 99
Here's my starting dataset:
> data
ID Record Value
A 1 100
A 3 200
A 4 300
B 1 800
For each ID, I want a record for each number 1 through 4. If a record is not available, create it using the most recent record.
The final dataset should look like this:
> newdata
ID Updt_Record Value
A 1 100
A 2 100
A 3 200
A 4 300
B 1 800
B 2 800
B 3 800
B 4 800
To do this, I am currently using dplyr:
library(dplyr)
data1 <- data %>% group_by(ID) %>% filter(Record <= 1) %>% filter(Record == max(Record)) %>% mutate(Updt_Record = 1)
data2 <- data %>% group_by(ID) %>% filter(Record <= 2) %>% filter(Record == max(Record)) %>% mutate(Updt_Record = 2)
data3 <- data %>% group_by(ID) %>% filter(Record <= 3) %>% filter(Record == max(Record)) %>% mutate(Updt_Record = 3)
data4 <- data %>% group_by(ID) %>% filter(Record <= 4) %>% filter(Record == max(Record)) %>% mutate(Updt_Record = 4)
newdata <- data1 %>%
bind_rows(data2) %>% bind_rows(data3) %>% bind_rows(data4) %>%
arrange(ID, Record) %>%
select(ID, Updt_Record, Value)
Is there a more efficient way of doing this? Thanks!
library(tidyr)
library(dplyr)
data %>%
mutate(Record=factor(Record, 1:4)) %>%
complete(ID, Record) %>%
fill(Value) %>%
mutate(Record=as.character(as.numeric(Record)))
# # A tibble: 8 x 3
# ID Record Value
# <fctr> <dbl> <int>
# 1 A 1 100
# 2 A 2 100
# 3 A 3 200
# 4 A 4 300
# 5 B 1 800
# 6 B 2 800
# 7 B 3 800
# 8 B 4 800
Data
data <- structure(list(ID = structure(c(1L, 1L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Record = c(1L, 3L, 4L, 1L), Value = c(100L,
200L, 300L, 800L)), .Names = c("ID", "Record", "Value"), class = "data.frame", row.names = c(NA,
-4L))