Related
I have a data frame like below
sample <- data.frame(ID = 1:9,
Group = c('AA','AA','AA','BB','BB','CC','CC','BB','CC'),
Value = c(1,1,1,2,2,2,3,2,3))
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
6 CC 2
7 CC 3
8 BB 2
9 CC 3
I want to select groups according to the number of distinct (unique) values within each group. For example, select groups where all values within the group are the same (one distinct value per group). If you look at the group CC, it has more than one distinct value (2 and 3) and should thus be removed. The other groups, with only one distinct value, should be kept. Desired output:
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
8 BB 2
Would you tell me simple and fast code in R that solves the problem?
Here's a solution using dplyr:
library(dplyr)
sample <- data.frame(
ID = 1:9,
Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),
Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)
sample %>%
group_by(Group) %>%
filter(n_distinct(Value) == 1)
We group the data by Group, and then only select groups where the number of distinct values of Value is 1.
data.table version:
library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]
# Group ID Value
#1: AA 1 1
#2: AA 2 1
#3: AA 3 1
#4: BB 4 2
#5: BB 5 2
#6: BB 8 2
An alternative using ave if the data is numeric, is to check if the variance is 0:
sample[with(sample, ave(Value, Group, FUN=var ))==0,]
An alternative solution that could be faster on large data is:
setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
The point is that calculating unique values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table, then take unique values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join (which is once again very fast). Here's a benchmark on large data:
require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7,
Group = sample(rep(paste0("id", 1:1e5), each=100)),
Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))
system.time (
ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
# user system elapsed
# 14.328 0.066 14.382
system.time ({
setkey(sample, Group, Value)
ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
# user system elapsed
# 5.661 0.219 5.877
setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE
You can make a selector for sample using ave many different ways.
sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]
Here's an approach
> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
ID Group Value
1 1 AA 1
2 2 AA 1
3 3 AA 1
4 4 BB 2
5 5 BB 2
8 8 BB 2
I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())
I would need to expand on this question: convert data frame of counts to proportions in R
I need to calculate proportion by one condition and retain the information of the dataset.
Reproducible example:
ID <- rep(c(1,2,3), each=3)
trial <- rep("a", 9)
variable1 <- sample(1:10, 9)
variable2 <- sample(1:10, 9)
variable3 <- sample(1:10, 9)
condition <- rep(c("i","j","k"), 3)
dat <- data.frame(cbind(ID, trial,variable1,variable2,variable3,condition))
For each variable I would like to have the proportion by the ID (i.e. 3 times)
Ideally the new variables would be stored in the same database as dat$variable1_p
I know how to do the trick by a series of for loops but I would like to learn how to use the apply function. Also to be able to expand it to more conditions if necessary.
We can use adply from the plyr package:
library(plyr)
adply(dat, 1, function(x)
c('variable1_p' = x$variable1 / sum(dat[x$ID == dat$ID,]$variable1)))
# ID trial variable1 variable2 variable3 condition variable1_p
# 1 1 a 3 5 4 i 0.20000000
# 2 1 a 8 9 9 j 0.53333333
# 3 1 a 4 4 8 k 0.26666667
# 4 2 a 7 10 5 i 0.50000000
# 5 2 a 6 8 10 j 0.42857143
# 6 2 a 1 1 7 k 0.07142857
# 7 3 a 10 6 3 i 0.47619048
# 8 3 a 9 7 6 j 0.42857143
# 9 3 a 2 3 2 k 0.09523810
Another option is to use dplyr, which would handle cases where there is more than one row per condition per ID:
library(dplyr)
dat %>%
group_by(ID, condition) %>%
mutate(sum_v1_cond = sum(variable1)) %>%
ungroup() %>%
group_by(ID) %>%
mutate(variable1_p = sum_v1_cond / sum(variable1)) %>%
select(-sum_v1_cond)
Edit - here's a full solution for variable1, variable2, and variable3:
adply(dat, 1, function(x)
c('variable1_p' = x$variable1 / sum(dat[x$ID == dat$ID,]$variable1),
'variable2_p' = x$variable2 / sum(dat[x$ID == dat$ID,]$variable2),
'variable3_p' = x$variable3 / sum(dat[x$ID == dat$ID,]$variable3)))
Data:
set.seed(123)
ID <- rep(c(1,2,3), each=3)
trial <- rep("a", 9)
variable1 <- sample(1:10, 9)
variable2 <- sample(1:10, 9)
variable3 <- sample(1:10, 9)
condition <- rep(c("i","j","k"), 3)
dat <- data.frame(ID, trial,variable1,variable2,variable3,condition,
stringsAsFactors = FALSE)
I have a big data set (roughly 10 000 rows), and want to create a function that counts the number of complete cases (not NAs) per group. I tried various functions (aggregate, table, sum(complete.cases), group_by, etc), but somehow I miss one - probably little - trick. Thanks for any help!
A little sample data set to explain, the result I need.
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
print(x)
# group age speed
#1 1 4 12
#2 2 3 NA
#3 3 2 15
#4 4 1 NA
#5 1 11 12
#6 2 NA NA
#7 3 13 15
#8 4 NA NA
One function I wrote reads as follows:
CountPerGroup <- function(group) {
data.set <- subset(x,group %in% group)
vect <- vector()
for (i in 1:length(group)) {
vect[i] <- sum(complete.cases(data.set))
}
output <- data.frame(cbind(group,count=vect))
return(output)
}
The result of
CountPerGroup(2:1)
is
group count
1 2 4
2 1 4
Unfortunately, this is wrong. Instead the outcome should look like
group count
1 2 1
2 1 4
What am I missing? How can I tell R to count of complete.cases per Group?
Thank you very much for any help on this!
Something like should do the trick if you wish to maintain your functionality:
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
CountPerGroup <- function(x, groups) {
data.set <- subset(x, group %in% groups)
ans <- sapply(split(data.set, data.set$group),
function(y) sum(complete.cases(y)))
return(data.frame(group = names(ans), count = unname(ans)))
}
CountPerGroup(x, 1:2)
# group count
#1 1 2
#2 2 0
Which is correct from what I can count. But it does not agree with your suggested outcome.
EDIT
It seems that you want the number of non-NA instead and correctly sorted. Use this function instead:
CountPerGroup2 <- function(x, groups) {
data.set <- subset(x, group %in% groups)
ans <- sapply(split(data.set, data.set$group),
function(y) sum(!is.na(y[, !grepl("group", names(y))])))[groups]
return(data.frame(group = names(ans), count = unname(ans)))
}
CountPerGroup2(x, 2:1)
# group count
#1 2 1
#2 1 4
If you are just looking for a way to get the full count of non-NA values per group, you could use something like:
library(plyr)
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
counts <- ddply(x, "group", summarize, count=sum(!is.na(c(age, speed))))
## group count
## 1 1 4
## 2 2 1
## 3 3 4
## 4 4 1
You do miss out on having a function that lets you query a subset of the groups, but you get a one-line way to calculate the full solution.
Here is a way with data.table
library(data.table)
library(functional)
countPerGroup = function(x, vec)
{
dt = data.table(x)
d1 = setkey(dt, group)[group %in% vec]
d2 = d1[,lapply(.SD, Compose(Negate(is.na), sum)),by=group]
transform(d2, count=age+speed, speed=NULL, age=NULL)
}
countPerGroup(x, 1:2)
# group count
#1: 1 4
#2: 2 1
countPerGroup(x, c(1,2))
# group count
#1: 1 4
#2: 2 1
If you have a high number of lines in your data.table, it is particularly efficient!
I just had the same problem and found an easier solution
library(data.table)
x <- data.table(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
x[,sum(complete.cases(.SD)), by=group]
I have a data frame like below
sample <- data.frame(ID = 1:9,
Group = c('AA','AA','AA','BB','BB','CC','CC','BB','CC'),
Value = c(1,1,1,2,2,2,3,2,3))
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
6 CC 2
7 CC 3
8 BB 2
9 CC 3
I want to select groups according to the number of distinct (unique) values within each group. For example, select groups where all values within the group are the same (one distinct value per group). If you look at the group CC, it has more than one distinct value (2 and 3) and should thus be removed. The other groups, with only one distinct value, should be kept. Desired output:
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
8 BB 2
Would you tell me simple and fast code in R that solves the problem?
Here's a solution using dplyr:
library(dplyr)
sample <- data.frame(
ID = 1:9,
Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),
Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)
sample %>%
group_by(Group) %>%
filter(n_distinct(Value) == 1)
We group the data by Group, and then only select groups where the number of distinct values of Value is 1.
data.table version:
library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]
# Group ID Value
#1: AA 1 1
#2: AA 2 1
#3: AA 3 1
#4: BB 4 2
#5: BB 5 2
#6: BB 8 2
An alternative using ave if the data is numeric, is to check if the variance is 0:
sample[with(sample, ave(Value, Group, FUN=var ))==0,]
An alternative solution that could be faster on large data is:
setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
The point is that calculating unique values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table, then take unique values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join (which is once again very fast). Here's a benchmark on large data:
require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7,
Group = sample(rep(paste0("id", 1:1e5), each=100)),
Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))
system.time (
ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
# user system elapsed
# 14.328 0.066 14.382
system.time ({
setkey(sample, Group, Value)
ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
# user system elapsed
# 5.661 0.219 5.877
setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE
You can make a selector for sample using ave many different ways.
sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]
Here's an approach
> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
ID Group Value
1 1 AA 1
2 2 AA 1
3 3 AA 1
4 4 BB 2
5 5 BB 2
8 8 BB 2