Mean of a column only for observations meeting a condition - r

How can I add a column with the mean of z for each group "y" for values where if x < 10 for any other case the mean column can take the value of z?
df <- data.frame(y = c(LETTERS[1:5], LETTERS[1:5],LETTERS[3:7]), x = 1:15, z = c(4:9,1:4,2:6))
y x z
1 A 1 4
2 B 2 5
3 C 3 6
4 D 4 7
5 E 5 8
6 A 6 9
7 B 7 1
8 C 8 2
9 D 9 3
10 E 10 4
11 C 11 2
12 D 12 3
13 E 13 4
14 F 14 5
I am trying something like
df %>% group_by(y) %>%
mutate(gr.mean = mean(z))
But this provides the mean for any case of x.

We can subset the 'z' with a logical condition on 'x':
library(dplyr)
df %>%
group_by(y) %>%
mutate(gr.mean = if(all(x >=10)) z else mean(z[x < 10])) %>%
ungroup
Output
# A tibble: 15 × 4
y x z gr.mean
<chr> <int> <int> <dbl>
1 A 1 4 6.5
2 B 2 5 3
3 C 3 6 4
4 D 4 7 5
5 E 5 8 8
6 A 6 9 6.5
7 B 7 1 3
8 C 8 2 4
9 D 9 3 5
10 E 10 4 8
11 C 11 2 4
12 D 12 3 5
13 E 13 4 8
14 F 14 5 5
15 G 15 6 6
Or without if/else
df %>%
group_by(y) %>%
mutate(gr.mean = coalesce(mean(z[x < 10]), z))

Related

Amount of overlap of two ranges in R [DescTools?]

I need to know by how many integers two numeric ranges overlap. I tried using DescTools::Overlap, but the output is not what I expected.
library(DescTools)
library(tidyr)
df1 <- data.frame(ID = c('a', 'b', 'c', 'd', 'e'),
var1 = c(1, 2, 3, 4, 5),
var2 = c(9, 3, 5, 7, 11))
df1 %>% setNames(paste0(names(.), '_2')) %>% tidyr::crossing(df1) %>% filter(ID != ID_2) -> pairwise
pairwise$overlap <- DescTools::Overlap(c(pairwise$var1,pairwise$var2),c(pairwise$var1_2,pairwise$var2_2))
The output (entire column) is '10' for each row in the test dataset created above. I want the row-specific overlap for each, so the first 3 columns would be 2,3,4, respectively.
I find the easiest way to do it is using rowwise. This operation used to be disadvised, but since dplyr 1.0.0 release, it's been improved in terms of performance.
pairwise %>%
rowwise() %>%
mutate(overlap = Overlap(c(var1, var2), c(var1_2, var2_2))) %>%
ungroup()
#> # A tibble: 20 x 7
#> ID_2 var1_2 var2_2 ID var1 var2 overlap
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 a 1 9 b 2 3 1
#> 2 a 1 9 c 3 5 2
#> 3 a 1 9 d 4 7 3
#> 4 a 1 9 e 5 11 4
#> 5 b 2 3 a 1 9 1
#> 6 b 2 3 c 3 5 0
#> 7 b 2 3 d 4 7 0
#> 8 b 2 3 e 5 11 0
#> 9 c 3 5 a 1 9 2
#> 10 c 3 5 b 2 3 0
#> 11 c 3 5 d 4 7 1
#> 12 c 3 5 e 5 11 0
#> 13 d 4 7 a 1 9 3
#> 14 d 4 7 b 2 3 0
#> 15 d 4 7 c 3 5 1
#> 16 d 4 7 e 5 11 2
#> 17 e 5 11 a 1 9 4
#> 18 e 5 11 b 2 3 0
#> 19 e 5 11 c 3 5 0
#> 20 e 5 11 d 4 7 2
My version with apply function
pairwise$overlap <- apply(pairwise, 1,
function(x) DescTools::Overlap(as.numeric(c(x[5], x[6])),
as.numeric(c(x[2],x[3]))))
pairwise
# A tibble: 20 x 7
ID_2 var1_2 var2_2 ID var1 var2 overlap
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 a 1 9 b 2 3 1
2 a 1 9 c 3 5 2
3 a 1 9 d 4 7 3
4 a 1 9 e 5 11 4
5 b 2 3 a 1 9 1
6 b 2 3 c 3 5 0
7 b 2 3 d 4 7 0
8 b 2 3 e 5 11 0
9 c 3 5 a 1 9 2
10 c 3 5 b 2 3 0
11 c 3 5 d 4 7 1
12 c 3 5 e 5 11 0
13 d 4 7 a 1 9 3
14 d 4 7 b 2 3 0
15 d 4 7 c 3 5 1
16 d 4 7 e 5 11 2
17 e 5 11 a 1 9 4
18 e 5 11 b 2 3 0
19 e 5 11 c 3 5 0
20 e 5 11 d 4 7 2

Only rows where difference between them is less than 'n' in groups

Let's say we have the below dataset where values in V2 are ordered ascending in groups V1:
Input =(" V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
10 B 12
11 C 13
12 C 14
13 C 18")
df = as.data.frame(read.table(textConnection(Input), header = T, row.names = 1))
Now I want to keep rows where the difference between consecutive ones is <= 1, so my desired output:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
11 C 13
12 C 14
However when I use:
df %>%
group_by(V1) %>%
filter(c(0,diff(V2)) <= 1)
I have:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 13
6 B 4
7 B 5
8 B 6
9 C 13
10 C 14
The row with V2 value 12 is missing and it should be in dataset. I tried also with lag() but result is same.
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | is.na(V2 - lag(V2)))
Could you point my mistake?
You need to subtract the values from both the sides. Try lead and lag :
library(dplyr)
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | V2 - lead(V2) <= 1)
# V1 V2
# <chr> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14
Here is another idea where we create groups with a tolerance of 1, and filter out those groups with only one observation, i.e.
df %>%
group_by(V1, grp = cumsum(c(TRUE, diff(V2) != 1))) %>%
filter(n() > 1) %>%
ungroup() %>%
select(-grp)
# A tibble: 11 x 2
# V1 V2
# <fct> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14

How to change the shape of a data frame in R? (stacking columns with the same names together)

I'm trying to reshape a data frame in R:
Gene_ID Value Gene_ID.1 Value.1 Gene_ID.2 Value.2
1 A 0 A 3 A 1
2 B 5 B 6 B 5
3 C 7 C 2 C 7
4 D 8 D 9 D 2
5 E 5 E 8 E 4
6 F 6 F 4 F 5
I want to make it look like this:
Gene_ID Value
1 A 0
2 B 5
3 C 7
4 D 8
5 E 5
6 F 6
7 A 1
8 B 5
9 C 7
10 D 2
11 E 4
12 F 5
13 A 3
14 B 6
15 C 2
16 D 9
17 E 8
18 F 4
So simply stack the columns with the same names together. Is there a way to do so?
Thanks!
You can use either the combination of gather()/spread() or pivot_longer() from the tidyr package.
To learn more about the new pivot_xxx() functions, check out these links:
A Graphical Introduction to tidyr's pivot_*()
Pivoting data from columns to rows (and back!) in the tidyverse
library(dplyr)
library(tidyr)
txt <- " Gene_ID.0 Value.0 Gene_ID.1 Value.1 Gene_ID.2 Value.2
1 A 0 A 3 A 1
2 B 5 B 6 B 5
3 C 7 C 2 C 7
4 D 8 D 9 D 2
5 E 5 E 8 E 4
6 F 6 F 4 F 5"
dat <- read.table(text = txt, header = TRUE)
Combine gather(), separate() and spread() functions
dat %>%
mutate(Row_Nr = row_number()) %>%
gather(key, value, -Row_Nr) %>%
separate(key, into = c("key", "Gene_Nr"), sep = "\\.") %>%
spread(key, value) %>%
select(-Row_Nr)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Gene_Nr Gene_ID Value
#> 1 0 A 0
#> 2 1 A 3
#> 3 2 A 1
#> 4 0 B 5
#> 5 1 B 6
#> 6 2 B 5
#> 7 0 C 7
#> 8 1 C 2
#> 9 2 C 7
#> 10 0 D 8
#> 11 1 D 9
#> 12 2 D 2
#> 13 0 E 5
#> 14 1 E 8
#> 15 2 E 4
#> 16 0 F 6
#> 17 1 F 4
#> 18 2 F 5
Use pivot_longer()
### gather all values columns
### separate original column names by the period "."
### into Gene_ID/Value and Gene_Nr
dat %>%
pivot_longer(everything(),
names_to = c(".value", "Gene_Nr"),
names_pattern = "(.*)\\.(.*)")
#> Gene_Nr Gene_ID Value
#> 1 0 A 0
#> 2 1 A 3
#> 3 2 A 1
#> 4 0 B 5
#> 5 1 B 6
#> 6 2 B 5
#> 7 0 C 7
#> 8 1 C 2
#> 9 2 C 7
#> 10 0 D 8
#> 11 1 D 9
#> 12 2 D 2
#> 13 0 E 5
#> 14 1 E 8
#> 15 2 E 4
#> 16 0 F 6
#> 17 1 F 4
#> 18 2 F 5
Created on 2019-12-08 by the reprex package (v0.3.0)

numbering duplicated rows in dplyr [duplicate]

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
ungroup()
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

How to mutate multiple variables without repeating codes?

I'm trying to create new variables from existing variables like below:
a1+a2=a3, b1+b2=b3, ..., z1+z2=z3
Here is an example data frame
df <- data.frame(replicate(10,sample(1:10)))
colnames(df) <- c("a1","a2","b1","b2","c1","c2","d1","d2","e1","e2")
Here's my solution with repeating codes
# a solution by base R
df$a3 <- df$a1 + df$a2
df$b3 <- df$b1 + df$b2
df$c3 <- df$c1 + df$c2
df$d3 <- df$d1 + df$d2
df$e3 <- df$e1 + df$e2
Or
# a solution by dplyr
library(dplyr)
df <- df %>%
mutate(a3 = a1+a2,
b3 = b1+b2,
c3 = c1+c2,
d3 = d1+d2,
e3 = e1+d2)
Or
# a solution by data.table
library(data.table)
DT <- data.table(df)
DT[,a3:=a1+a2][,b3:=b1+b2][,c3:=c1+c2][,d3:=d1+d2][,e3:=e1+e2]
Actually I have more than 100 variables, so I want to find a way to do so without repeating code... Although I tried to use mutate_ with standard evaluation and regular expression, I lost my way because I'm a newbie in R. Can you mutate multiple variables without repeating code?
Your data format is making this hard - I would reshape the data like this. In general, you shouldn't encode actual data information in column names, if the difference between a1 and a2 is meaningful, it is better to have a column with letter, a, b, c and a column with number, 1, 2.
df$id = 1:nrow(df)
library(tidyr)
library(dplyr)
tdf = gather(df, key = key, value = value, -id) %>%
separate(key, into = c("letter", "number"), sep = 1) %>%
mutate(number = paste0("V", number)) %>%
spread(key = number, value = value)
## now data is "tidy":
head(tdf)
# id letter V1 V2
# 1 1 a 2 7
# 2 1 b 10 4
# 3 1 c 9 10
# 4 1 d 9 4
# 5 1 e 5 8
# 6 2 a 9 8
## and the operation is simple:
tdf$V3 = tdf$V1 + tdf$V2
head(tdf)
# id letter V1 V2 V3
# 1 1 a 2 7 9
# 2 1 b 10 4 14
# 3 1 c 9 10 19
# 4 1 d 9 4 13
# 5 1 e 5 8 13
# 6 2 a 9 8 17
A possible solution using data.table:
DT <- data.table(df)[, rn := .I]
DTadd3 <- dcast(melt(DT, measure.vars = 1:10)[, `:=` (let = substr(variable,1,1), rn = 1:.N), variable
][, s3 := sum(value), .(let,rn)],
rn ~ paste0(let,3), value.var = 's3', mean)
DT[DTadd3, on = 'rn'][, rn := NULL][]
which gives:
a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 a3 b3 c3 d3 e3
1: 10 5 9 5 10 4 5 3 7 10 15 14 14 8 17
2: 2 6 6 8 3 8 7 1 4 7 8 14 11 8 11
3: 6 4 7 4 4 3 4 6 3 3 10 11 7 10 6
4: 1 2 4 2 9 9 3 7 10 4 3 6 18 10 14
5: 9 10 8 1 8 7 10 5 9 1 19 9 15 15 10
6: 8 8 10 6 2 5 2 4 2 6 16 16 7 6 8
7: 7 9 1 7 5 10 9 2 1 8 16 8 15 11 9
8: 5 1 2 9 7 2 1 8 5 5 6 11 9 9 10
9: 3 7 3 3 1 6 8 10 8 9 10 6 7 18 17
10: 4 3 5 10 6 1 6 9 6 2 7 15 7 15 8
A similar solution using dplyr and tidyr:
df %>%
bind_cols(., df %>%
gather(var, val) %>%
group_by(var) %>%
mutate(let = substr(var,1,1), rn = 1:n()) %>%
group_by(let,rn) %>%
summarise(s3 = sum(val)) %>%
spread(let, s3) %>%
select(-rn)
)
However, as noted by #Gregor, it is much better to transform your data into long format. The data.table equivalent of #Gregor's answer:
DT <- data.table(df)
melt(DT[, rn := .I],
variable.name = 'let',
measure.vars = patterns('1$','2$'),
value.name = paste0('v',1:2)
)[, `:=` (let = letters[let], v3 = v1 + v2)][]
which gives (first 15 rows):
rn let v1 v2 v3
1: 1 a 10 5 15
2: 2 a 2 6 8
3: 3 a 6 4 10
4: 4 a 1 2 3
5: 5 a 9 10 19
6: 6 a 8 8 16
7: 7 a 7 9 16
8: 8 a 5 1 6
9: 9 a 3 7 10
10: 10 a 4 3 7
11: 1 b 9 5 14
12: 2 b 6 8 14
13: 3 b 7 4 11
14: 4 b 4 2 6
15: 5 b 8 1 9
My data.table solution:
sapply(c("a", "b", "c", "d", "e"), function(ll)
df[ , paste0(ll, 3) := get(paste0(ll, 1)) + get(paste0(ll, 2))])
df[]
# a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 a3 b3 c3 d3 e3
# 1: 5 2 2 6 4 1 10 7 3 9 7 8 5 17 12
# 2: 4 8 7 3 3 7 9 6 9 7 12 10 10 15 16
# 3: 10 7 6 10 1 9 4 1 2 4 17 16 10 5 6
# 4: 3 4 1 7 6 4 7 4 7 5 7 8 10 11 12
# 5: 8 3 4 2 2 2 3 3 4 10 11 6 4 6 14
# 6: 6 6 5 1 8 10 1 10 5 3 12 6 18 11 8
# 7: 2 10 8 9 5 6 2 5 10 2 12 17 11 7 12
# 8: 1 1 10 8 9 5 6 9 6 8 2 18 14 15 14
# 9: 9 5 3 5 10 3 5 2 1 6 14 8 13 7 7
# 10: 7 9 9 4 7 8 8 8 8 1 16 13 15 16 9
Or, more extensibly:
sapply(c("a", "b", "c", "d", "e"), function(ll)
df[ , paste0(ll, 3) := Reduce(`+`, mget(paste0(ll, 1:2)))])
If all of the variables fit the pattern of ending with 1 or 2, you might try:
stems = unique(gsub("[0-9]", "", names(df)))
Then sapply(stems, ...)
library(tidyverse)
reduce(.init=df, .x=letters[1:5], .f~{
mutate(.x, '{.y}3' := get(str_c(.y, 1)) + get(str_c(.y, 2)))
})

Resources