Subset by two factors variables [duplicate] - r

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(8 answers)
Closed 4 years ago.
I'd like to aggregate my dataset considering the interations between two factors (fac1, fac2) and apply a function for this. For example, consider the dataset given by
set.seed(1)
test <- data.frame(fac1 = sample(c("A", "B", "C"), 30, rep = T),
fac2 = sample(c("a", "b"), 30, rep = T),
value = runif(30))
For fac1 == "A" and "fac2 == a" we have five values and I'd like to aggregate by min. Using brutal force I tried this way
min(test[test$fac1 == "A" & test$fac2 == "a", ]$value)

You mention aggregate and that will work here.
aggregate(test$value, test[,1:2], min)
fac1 fac2 x
1 A a 0.32535215
2 B a 0.14330438
3 C a 0.33239467
4 A b 0.33907294
5 B b 0.08424691
6 C b 0.24548851

Here is a tidyverse alternative
test %>% group_by(fac1, fac2) %>% summarise(x = min(value))
## A tibble: 6 x 3
## Groups: fac1 [?]
# fac1 fac2 x
# <fct> <fct> <dbl>
#1 A a 0.325
#2 A b 0.339
#3 B a 0.143
#4 B b 0.0842
#5 C a 0.332
#6 C b 0.245

Related

Using R & dplyr to summarize - group_by, count, mean, sd [closed]

I am fairly new to R and even newer to dplyr. I have a small data set comprised of 2 columns - var1 and var2. The var1 column is comprised of num values. The var2 column is comprised of factors with 3 levels - A, B, and C.
var1 var2
1 1.4395244 A
2 1.7698225 A
3 3.5587083 A
4 2.0705084 A
5 2.1292877 A
6 3.7150650 B
7 2.4609162 B
8 0.7349388 B
9 1.3131471 B
10 1.5543380 B
11 3.2240818 C
12 2.3598138 C
13 2.4007715 C
14 2.1106827 C
15 1.4441589 C
'data.frame': 15 obs. of 2 variables:
$ var1: num 1.44 1.77 3.56 2.07 2.13 ...
$ var2: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 2 2 ...
I am trying to use dplyr to group_by var2 (A, B, and C) then count, and summarize the var1 by mean and sd. The count works but rather than provide the mean and sd for each group, I receive the overall mean and sd next to each group.
To try to resolve the issue, I have conducted multiple internet searches. All results seem to offer a similar syntax to the one I am using. I have also read through all of the recommended posts that Stack Overflow offered prior to posting. Also, I tried restarting R and I made sure that I am not using plyr.
Here is the code that I used to create the data set and the dplyr group_by / summarize.
library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C")
df <- data.frame(var1, var2)
df
df %>%
group_by(df$var2) %>%
summarize(
count = n(),
mean = mean(df$var1, na.rm = TRUE),
sd = sd(df$var1, na.rm = TRUE)
)
Here are the results:
# A tibble: 3 x 4
`df$var2` count mean sd
<fct> <int> <dbl> <dbl>
1 A 5 2.15 0.845
2 B 5 2.15 0.845
3 C 5 2.15 0.845
The count appears to work showing a count of 5 for each group. Each group is showing the overall mean and sd for the whole column rather than each group. The expected results are the count, mean, and sd for each group.
I am sure I am overlooking something obvious but I would greatly appreciate any assistance.
Even though answered via comments, I felt such a nice reproducible example for a very first question deserved an official answer.
library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c(rep("A", 5), rep("B", 5), rep("C", 5))
df <- data.frame(var1, var2)
df_stat <- df %>% group_by(var2) %>% summarize(
count = n(),
mean = mean(var1, na.rm = TRUE),
sd = sd(var1, na.rm = TRUE))
head(df_stat)
# A tibble: 3 x 4
# var2 count mean sd
# <fct> <int> <dbl> <dbl>
# 1 A 5 2.19 0.811
# 2 B 5 1.96 1.16
# 3 C 5 2.31 0.639

Rank a dataframe based on multiple conditions [duplicate]

Suppose I have the following data
df = data.frame(name=c("A", "B", "C", "D"), score = c(10, 10, 9, 8))
I want to add a new column with the ranking. This is what I'm doing:
df %>% mutate(ranking = rank(score, ties.method = 'first'))
# name score ranking
# 1 A 10 3
# 2 B 10 4
# 3 C 9 2
# 4 D 8 1
However, my desired result is:
# name score ranking
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Clearly rank does not do what I have in mind. What function should I be using?
It sounds like you're looking for dense_rank from "dplyr" -- but applied in a reverse order than what rank normally does.
Try this:
df %>% mutate(rank = dense_rank(desc(score)))
# name score rank
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Other solution when you need to apply the rank to all variables (not just one).
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
select(df, -name) %>% mutate_all(funs(dense_rank(desc(.))))
#user101089 --- you can try out with this alternative way:
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
df %>% mutate(rank_score = dense_rank(desc(score)),
rank_score2 = dense_rank(desc(score2)))

For each observation, find a corresponding centile on a subset determined by factor

Assume I have a data frame like so:
df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400))
I want to create a new column, which will contain a centile that an observation belongs to, calculated separately on each factor level.
What would be a reasonably simple and efficient way to do that? The closest I came to a solution was
df$newColumn<-findInterval(df$value, tapply(df$value, df$f, quantile, probs=seq(0, 0.99, 0.01))$df[, "f"])
However, this just gives zeros to all observations. The tapply returns a four-element list of quantile vectors and I'm not sure how to access a relevant element for each observation to pass as an argument for the findInterval function.
The number of rows in the data frame could reach a few millions, so speed is an issue too. The factor column will always have four levels.
With dplyr:
library(dplyr)
df %>%
group_by(f) %>%
mutate(quant = findInterval(value, quantile(value)))
#> Source: local data frame [400 x 3]
#> Groups: f [4]
#>
#> f value quant
#> <fctr> <dbl> <int>
#> 1 a 0.51184061 3
#> 2 b 0.44362348 3
#> 3 c -1.04869448 1
#> 4 d -2.41772425 1
#> 5 a 0.10738332 3
#> 6 b -0.58630348 1
#> 7 c 0.34376820 3
#> 8 d 0.68322738 4
#> 9 a 1.00232314 4
#> 10 b 0.05499391 3
#> # ... with 390 more rows
With data.table:
library(data.table)
dt <- setDT(df)
dt[, quant := findInterval(value, quantile(value)), by = f]
dt
#> f value quant
#> 1: a 0.3608395 3
#> 2: b -0.1028948 2
#> 3: c -2.1903336 1
#> 4: d 0.7470262 4
#> 5: a 0.5292031 3
#> ---
#> 396: d -1.3475332 1
#> 397: a 0.1598605 3
#> 398: b -0.4261003 2
#> 399: c 0.3951650 3
#> 400: d -1.4409000 1
Data:
df <- data.frame(f = rep(c("a", "b", "c", "d"), 100), value = rnorm(400))
I think that data.table is faster, however, a solution without using packages is:
Define a function based on cut or findInterval together with quantile
cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T , labels=1:100)
}
then, apply it by a factor using ave
df$newColumn <- ave(df$values, df$f, FUN=cut2)

Grouping of R dataframe by connected values

I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))

Sum of two Columns of Data Frame with NA Values

I have a data frame with some NA values. I need the sum of two of the columns. If a value is NA, I need to treat it as zero.
a b c d
1 2 3 4
5 NA 7 8
Column e should be the sum of b and c:
e
5
7
I have tried a lot of things, and done two dozen searches with no luck. It seems like a simple problem. Any help would be appreciated!
dat$e <- rowSums(dat[,c("b", "c")], na.rm=TRUE)
dat
# a b c d e
# 1 1 2 3 4 5
# 2 5 NA 7 8 7
dplyr solution, taken from here:
library(dplyr)
dat %>%
rowwise() %>%
mutate(e = sum(b, c, na.rm = TRUE))
Here is another solution, with concatenated ifelse():
dat$e <- ifelse(is.na(dat$b) & is.na(dat$c), dat$e <-0, ifelse(is.na(dat$b), dat$e <- 0 + dat$c, dat$b + dat$c))
# a b c d e
#1 1 2 3 4 5
#2 5 NA 7 8 7
Edit, here is another solution that uses with as suggested by #kasterma in the comments, this is much more readable and straightforward:
dat$e <- with(dat, ifelse(is.na(b) & is.na(c ), 0, ifelse(is.na(b), 0 + c, b + c)))
if you want to keep NA if both columns has it you can use:
Data, sample:
dt <- data.table(x = sample(c(NA, 1, 2, 3), 100, replace = T), y = sample(c(NA, 1, 2, 3), 100, replace = T))
Solution:
dt[, z := ifelse(is.na(x) & is.na(y), NA_real_, rowSums(.SD, na.rm = T)), .SDcols = c("x", "y")]
(the data.table way)
I hope that it may help you
Some cases you have a few columns that are not numeric. This approach will serve you both.
Note that: c_across() for dplyr version 1.0.0 and later
df <- data.frame(
TEXT = c("text1", "text2"), a = c(1,5), b = c(2, NA), c = c(3,7), d = c(4,8))
df2 <- df %>%
rowwise() %>%
mutate(e = sum(c_across(a:d), na.rm = TRUE))
# A tibble: 2 x 6
# Rowwise:
# TEXT a b c d e
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 text1 1 2 3 4 10
# 2 text2 5 NA 7 8 20

Resources