I have a data frame with a mixture of patient and appointment information. Each patient may have attended multiple appointments. Some patient information is collected at each appointment, with the result that some patient information is conflicting and some is missing.
I want to fill in the missing patient information for appointments where it wasn't recorded based upon data recorded at other appointments. And (this is where I'm stuck) I want to take the 'minimum' level (earliest in order of levels) recorded for factors that have conflicting information for the same patient. (In the example below the factor levels are in alphabetical order , but this isn't always the case).
This is similar to this question, but I am using factors instead of characters, and I have multiple factors that I want the minimum for so can't filter by row.
eg. I have
df.have <- data.frame(
grp_id = rep(1:3, each = 2),
grpvar1 = factor(c("B", "A", "B", "C", NA, "A")),
grpvar2 = factor(c("a", "b", "c", NA, NA, "x")),
appt_id = 1:6)
I want
grp_id grpvar1 grpvar2 appt_id
1 A a 1
1 A a 2
2 B c 3
2 B c 4
3 A x 5
3 A x 6
or at least
grp_id grpvar1 grpvar1.1
1 A a
2 B c
3 A x
We can try with summarise_each. As we need the first level of 'grpvar' variables, we need to make sure that the unused levels are dropped before we do that (using droplevels).
df.have %>%
group_by(grp_id) %>%
summarise_each(funs(first(levels(droplevels(.)))), grpvar1:grpvar2)
# grp_id grpvar1 grpvar2
# <int> <chr> <chr>
#1 1 A a
#2 2 B c
#3 3 A x
Or if we use mutate_each, we get the first output
df.have %>%
group_by(grp_id) %>%
mutate_each(funs(levels(droplevels(.))[1]), grpvar1:grpvar2)
# grp_id grpvar1 grpvar2 appt_id
# <int> <chr> <chr> <int>
#1 1 A a 1
#2 1 A a 2
#3 2 B c 3
#4 2 B c 4
#5 3 A x 5
#6 3 A x 6
if we need the output as factor columns for 'grpvar's.
df.have %>%
group_by(grp_id) %>%
mutate_each(funs(factor(levels(droplevels(.))[1])), grpvar1:grpvar2)
# grp_id grpvar1 grpvar2 appt_id
# <int> <fctr> <fctr> <int>
#1 1 A a 1
#2 1 A a 2
#3 2 B c 3
#4 2 B c 4
#5 3 A x 5
#6 3 A x 6
Or using data.table
library(data.table)
setDT(df.have)[, lapply(.SD, function(x) levels(droplevels(x))[1]) ,
grp_id, .SDcols = grpvar1:grpvar2]
While revisiting this for a different project with a larger dataset I realised that it was much more efficient (though required more keystrokes) to convert the factors to numbers, take the minimum, then reconvert back to factors.
library(data.table)
library(dplyr)
set.seed(1)
n <- 100L
dat <- data.table(
grp_id = rep(1:n/10, each = 10),
grpvar1 = factor(sample(c(LETTERS, NA), n, replace = TRUE), levels = LETTERS),
grpvar2 = factor(sample(c(letters, NA), n, replace = TRUE), levels = letters),
appt_id = 1:n)
cols <- c("grpvar1","grpvar2")
dplyr_fct <- function(data, cols) {
data %>%
group_by(grp_id) %>%
mutate_each(funs(factor(levels(droplevels(.))[1])), one_of(cols))
}
dt_fct <- function(data, cols) {
data[, lapply(.SD, function(x) levels(droplevels(x))[1]), grp_id, .SDcols = cols]}
dt_nmbr <- function(data, cols) {
dat_out <- copy(data)
v_lvl = lapply(dat_out[, .SD, .SDcols = cols], levels)
# Convert factors to numeric
for(col in cols) set(dat_out, j = col, value = as.numeric(dat_out[[col]]))
# Select highest value
dat_out[, (cols):= lapply(.SD, min, na.rm = TRUE), by = grp_id, .SDcols = cols]
# Convert back to factor
for(col in cols) set(dat_out, j = col,
value = factor(dat_out[[col]], levels = 1:length(v_lvl[[col]]), labels = v_lvl[[col]]))
assign("dat_out", dat_out, envir = .GlobalEnv)
}
mbm <- microbenchmark::microbenchmark(
mbm_dplyr_fct = dplyr_fct(dat, cols),
mbm_dt_fct = dt_fct(dat, cols),
mbm_dt_nmbr = dt_nmbr(dat, cols)
)
mbm
Unit: milliseconds
expr min lq mean median uq max neval cld
mbm_dplyr_fct 84.487484 85.829834 90.988740 87.015878 91.159178 120.22171 100 c
mbm_dt_fct 56.768529 58.007094 60.988083 58.831850 60.269427 87.11799 100 b
mbm_dt_nmbr 4.181538 4.406392 4.540248 4.557948 4.619757 6.04197 100 a
Related
I have a data frame below. I need to find the the row min and max except few column that are characters.
df
x y z
1 1 1 a
2 2 5 b
3 7 4 c
I need
df
x y z Min Max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
Another dplyr possibility could be:
df %>%
mutate(Max = do.call(pmax, select_if(., is.numeric)),
Min = do.call(pmin, select_if(., is.numeric)))
x y z Max Min
1 1 1 a 1 1
2 2 5 b 5 2
3 7 4 c 7 4
Or a variation proposed be #G. Grothendieck:
df %>%
mutate(Min = pmin(!!!select_if(., is.numeric)),
Max = pmax(!!!select_if(., is.numeric)))
Another base R solution. Subset only the columns with numbers and then use apply in each row to get the minimum and maximum value with range.
cbind(df, t(apply(df[sapply(df, is.numeric)], 1, function(x)
setNames(range(x, na.rm = TRUE), c("min", "max")))))
# x y z min max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
1) This one-liner uses no packages:
transform(df, min = pmin(x, y), max = pmax(x, y))
giving:
x y z min max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
2) If you have many columns and don't want to list them all or determine yourself which are numeric then this also uses no packages.
ix <- sapply(df, is.numeric)
transform(df, min = apply(df[ix], 1, min), max = apply(df[ix], 1, max))
If your actual data has NAs and if you want to ignore them when taking the min or max then min, max, pmin and pmax all take an optional na.rm = TRUE argument.
Note
Lines <- "x y z
1 1 1 a
2 2 5 b
3 7 4 c"
df <- read.table(text = Lines)
1) We can use select_if. Here, we can use select_if to select the columns that are numeric, then with pmin, pmax get the rowwise min and max and bind it with the original dataset
library(dplyr)
library(purrr)
df %>%
select_if(is.numeric) %>%
transmute(Min = reduce(., pmin, na.rm = TRUE),
Max = reduce(., pmax, na.rm = TRUE)) %>%
bind_cols(df, .)
# x y z Min Max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
NOTE: Here, we use only a single expression of select_if
2) The same can be done in base R (no packages used)
i1 <- names(which(sapply(df, is.numeric)))
df['Min'] <- do.call(pmin, c(df[i1], na.rm = TRUE))
df['Max'] <- do.call(pmax, c(df[i1], na.rm = TRUE))
Also, as stated in the comments, this is generalized option. If it is only for two columns, just doing pmin(x, y) or pmax(x,y) is possible and that wouldn't check if the columns are numeric or not and it is not a general solution
NOTE: All of the solutions mentioned here are either answered first or from the comments with the OP
data
df <- structure(list(x = c(1L, 2L, 7L), y = c(1L, 5L, 4L), z = c("a",
"b", "c")), class = "data.frame", row.names = c("1", "2", "3"
))
I have a Data Frame with a variable with different values for another variable.
Like this:
DataFrame
So, I need a subset when the value of S contain all the possible values of B. In this example, el subset is conformed by S = a and S = b:
Subset
Any idea? Thanks!!
An option would be to group by 'S' and filter the rows having all the unique values of the column 'B' %in% 'B'
library(dplyr)
un1 <- unique(df1$B)
df1 %>%
group_by(S) %>%
filter(all(un1 %in% B))
# A tibble: 8 x 2
# Groups: S [2]
# S B
# <fct> <dbl>
#1 a 1
#2 a 2
#3 a 3
#4 a 4
#5 d 1
#6 d 2
#7 d 3
#8 d 4
Or with data.table
library(data.table)
setDT(df1)[, .SD[all(un1 %in% B)], S]
Or using base R
df1[with(df1, ave(B, S, FUN = function(x) all(un1 %in% x)) == 1),]
data
df1 <- data.frame(S = rep(letters[1:4], c(4, 3, 2, 4)),
B = c(1:4, c(1, 3, 4), 1:2, 1:4))
I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())
I am having some difficulty creating a function which would group by according to column P and count all the "YES" for the columns that have been saved in a list such as "list_col". I am pretty sure I would be using a function from the apply family but not sure how to group it by a certain column (Col P in this case)
P <- as.character(c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
a <- as.character(c(NA,"Yes","Yes",NA,NA,NA,"Yes","Yes","Yes",NA))
b <- as.character(c(NA,"Yes",NA,NA,"Yes",NA,"Yes","Yes",NA,NA))
c <- as.character(c(NA,NA,NA,NA,"Yes",NA,"Yes",NA,NA,NA))
df_sample <- as.data.frame(cbind(P, a, b, c))
df_sample
list_col <- c("a","b","c")
Ideally I would be looking for the following answer with the above sample dataframe and the answer below with the columns changing according to the "list_col"
P a b c
A 2 2 1
B 3 2 1
Any help would be appreciated
Here is an approach via dplyr,
library(dplyr)
df_sample %>%
group_by(P) %>%
select(list_col) %>%
summarise_all(funs(sum(. == 'Yes', na.rm = TRUE)))
#Adding missing grouping variables: `P`
# A tibble: 2 x 4
# P a b c
# <fctr> <int> <int> <int>
#1 A 2 2 1
#2 B 3 2 1
In data.table:
library(data.table)
list_col <- c("a","b","c")
setDT(df_sample)[, (lapply(.SD, function(x) sum(x=="Yes", na.rm = TRUE))), by = P, .SDcols = list_col]
# P a b c
#1: A 2 2 1
#2: B 3 2 1
Alternatively, a base R solution still using lapply:
res <-lapply(split(df_sample[,list_col], df_sample$P), function(x) colSums(x=="Yes", na.rm = TRUE))
do.call(rbind, res)
# a b c
#A 2 2 1
#B 3 2 1
For what it's worth, a microbenchmark on my machine:
microbenchmark::microbenchmark(splitlapply = do.call(rbind, lapply(split(df_sample[,list_col], df_sample$P), function(x) colSums(x=="Yes", na.rm = TRUE))),
+ dt = sampleDT[, (lapply(.SD, function(x) sum(x=="Yes", na.rm = TRUE))), by = P, .SDcols = list_col])
Unit: microseconds
expr min lq mean median uq max neval
splitlapply 455.841 505.0715 546.6699 529.3225 561.2315 889.436 100
dt 861.722 1052.9920 1114.2752 1111.7040 1166.7695 1707.761 100
Using melt from reshape
library(reshape)
df=na.omit(melt(df_sample,id.vars='P'))
table(df$P,df$variable)
a b c
A 2 2 1
B 3 2 1
df_sample <- as.data.frame(cbind(P, a, b, c), stringsAsFactors = FALSE)
for (i in list_col){
df_r <- df_sample[, i] == 'Yes' & !is.na(df_sample[, i])
df_sample[df_r, i] <- df_sample$P[df_r]
}
sapply(df_sample[, list_col], table)
I need to do something similar to below on a very large data set (with many groups), and read somewhere that using .SD is slow. Is there any faster way to perform the following operation?
To be more precise, I need to create a new column that contains the min value for each group after having excluded a subset of observations in that group (something similar to minif in Excel).
library(data.table)
dt <- data.table(valid = c(0,1,1,0,1),
a = c(1,1,2,3,4),
groups = c("A", "A", "A", "B", "B"))
dt[, valid_min := .SD[valid == 1, min(a, na.rm = TRUE)], by = groups]
With the output:
> test
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
To make it even more complicated, groups could have no valid entries or they could have multiple valid but missing entries. My current code is similar to this:
dt <- data.table(valid = c(0,1,1,0,1,0,1,1),
a = c(1,1,2,3,4,3,NA,NA),
k = c("A", "A", "A", "B", "B", "C", "D", "D"))
dt[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k]
Output:
> dt
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
6: 0 3 C NA
7: 1 NA D NA
8: 1 NA D NA
There's...
dt[dt[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
This should be fast since the inner call to min is optimized for groups. (See ?GForce.)
We can do the same using dplyr
dt %>%
group_by(groups) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE))
Which gives:
valid a groups valid_min
<dbl> <dbl> <chr> <dbl>
1 0 1 A 1
2 1 1 A 1
3 1 2 A 1
4 0 3 B 4
5 1 4 B 4
Alternatively, if you are not interested in keeping the 'non-valid' rows, we can do the following:
dt %>%
filter(valid == 1) %>%
group_by(groups) %>%
mutate(valid_min = min(a))
Looks like I provided the slowest approach. Comparing each approach (using a larger, replicated data frame called df) with a microbenchmark test:
library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
dplyr.test = suppressWarnings(df %>%
group_by(k) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE),
valid_min = ifelse(valid_min == Inf,
NA,
valid_min))),
data.table.test = df[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k],
GForce.test = df[df[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
)
autoplot(mbm)
...well, i tried...