How to divide between groups of rows using dplyr? - r

I have this dataframe:
x <- data.frame(
name = rep(letters[1:4], each = 2),
condition = rep(c("A", "B"), times = 4),
value = c(2,10,4,20,8,40,20,100)
)
# name condition value
# 1 a A 2
# 2 a B 10
# 3 b A 4
# 4 b B 20
# 5 c A 8
# 6 c B 40
# 7 d A 20
# 8 d B 100
I want to group by name and divide the value of rows with condition == "B" with those with condition == "A", to get this:
data.frame(
name = letters[1:4],
value = c(5,5,5,5)
)
# name value
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
I know something like this can get me pretty close:
x$value[which(x$condition == "B")]/x$value[which(x$condition == "A")]
but I was wondering if there was an easy way to do this with dplyr (My dataframe is a toy example and I got to it by chaining multiple group_by and summarise calls).

Try:
x %>%
group_by(name) %>%
summarise(value = value[condition == "B"] / value[condition == "A"])
Which gives:
#Source: local data frame [4 x 2]
#
# name value
# (fctr) (dbl)
#1 a 5
#2 b 5
#3 c 5
#4 d 5

I'd use spread from tidyr.
library(dplyr)
library(tidyr)
x %>%
spread(condition, value) %>%
mutate(value = B/A)
name A B value
1 a 2 10 5
2 b 4 20 5
3 c 8 40 5
4 d 20 100 5
You could then do select(-A, -B) to drop the extra columns.

Using data.table, convert the 'data.frame' to 'data.table' (setDT(x)), grouped by 'name', we divide the 'value' corresponds to 'B' condition by the those that corresponds to 'A' 'condition'.
library(data.table)
setDT(x)[,.(value = value[condition=="B"]/value[condition=="A"]) , name]
# name value
#1: a 5
#2: b 5
#3: c 5
#4: d 5
Or reshape from 'long' to 'wide' and divide the 'B' column by 'A'.
dcast(setDT(x), name~condition, value.var='value')[, .(name, value = B/A)]

Related

How can I remove rows with the same value in 2 ore more rows in R

I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")

Update column of dataframe1 based on column of dataframe2 + create new row if column1 is not empty

I have a dataframe that I want to update with information from another dataframe, a lookup dataframe.
In particular, I'd like to update the cells of df1$value with the cells of df2$value based on the columns id and id2.
If the cell of df1$value is NA, I know how to do it using the package data.table
BUT
If the cell of df1$value is not empty, data.table will update it with the cell of df2$value anyway.
I don't want that. I'd like to have that:
IF the cell of df1$value is NOT empty (in this case the row in which df1$id is c), do not update the cell but create a duplicate row of df1 in which the cell of df1$value takes the value from the cell of df2$value
I already looked for solutions online but I couldn't find any. Is there a way to do it easily with tidyverse or data.table or an sql-like package?
Thank you for your help!
edit: I've just realized that I forgot to put the corner case in which in both dataframes the row is NA. With the replies I had so far (07/08/19 14:42) the row e is removed from the last dataframe. But I really need to keep it!
Outline:
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
5 e 5 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 NA
# I'd like:
> df5
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
6 e 5 NA
This is how I managed to solve my problem but it's quite cumbersome.
# I create the dataframes
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
# I first do a left_join so I'll have two value columnes: value.x and value.y
df3 <- dplyr::left_join(df1, df2, by = c("id","id2"))
# > df3
# id id2 value.x value.y
# 1 a 1 100 NA
# 2 b 2 101 NA
# 3 c 3 50 200
# 4 d 4 NA 201
# I keep only the rows in which value.x is NA, so the 4th row
df4 <- df3 %>%
filter(is.na(value.x)) %>%
dplyr::select(id, id2, value.y)
# > df4
# id id2 value.y
# 1 d 4 201
# I rename the column "value.y" to "value". (I don't do it with dplyr because the function dplyr::replace doesn't work in my R version)
colnames(df4)[colnames(df4) == "value.y"] <- "value"
# > df4
# id id2 value
# 1 d 4 201
# I update the df1 with the df4$value. This step is necessary to update only the rows of df1 in which df1$value is NA
setDT(df1)[setDT(df4), on = c("id","id2"), `:=`(value = i.value)]
# > df1
# id id2 value
# 1: a 1 100
# 2: b 2 101
# 3: c 3 50
# 4: d 4 201
# I filter only the rows in which both value.x and value.y are NAs
df3 <- as_tibble(df3) %>%
filter(!is.na(value.x), !is.na(value.y)) %>%
dplyr::select(id, id2, value.y)
# > df3
# # A tibble: 1 x 3
# id id2 value.y
# <chr> <dbl> <dbl>
# 1 c 3 200
# I rename column df3$value.y to value
colnames(df3)[colnames(df3) == "value.y"] <- "value"
# I bind by rows df1 and df3 and I order by the column id
df5 <- rbind(df1, df3) %>%
arrange(id)
# > df5
# id id2 value
# 1 a 1 100
# 2 b 2 101
# 3 c 3 50
# 4 c 3 200
# 5 d 4 201
A left join with data.table:
library(data.table)
setDT(df1); setDT(df2)
df2[df1, on=.(id, id2), .(value =
if (.N == 0) i.value
else na.omit(c(i.value, x.value))
), by=.EACHI]
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
How it works: The syntax is x[i, on=, j, by=.EACHI]: for each row of i = df1 do j.
In this case j = .(value = expr) where .() is a shortcut to list() since in general j should return a list of columns.
Regarding the expression, .N is the number of rows of x = df2 that are found for each row of i = df1, so if no matches are found we keep values from i; and otherwise we keep values from both tables, dropping missing values.
A dplyr way:
bind_rows(df1, semi_join(df2, df1, by=c("id", "id2"))) %>%
group_by(id, id2) %>%
do(if (nrow(.) == 1) . else na.omit(.))
# A tibble: 5 x 3
# Groups: id, id2 [4]
id id2 value
<chr> <dbl> <dbl>
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
Comment. The dplyr way is kind of awkward because do() is needed to get a dynamically determined number of rows, but do() is typically discouraged and does not support n() and other helper functions. The data.table way is kind of awkward because there is no simple semi join functionality.
Data:
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 300
Another idea via base R is to remove the rows from df2 that do not match in df1, bind the two data frames rowwise (rbind) and omit the NAs, i.e.
na.omit(rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),]))
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 c 3 200
#6 d 4 201
To answer your new requirements, we can keep the same rbind method and filter based on your conditions, i.e.
dd <- rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),])
dd[!!with(dd, ave(value, id, id2, FUN = function(i)(all(is.na(i)) & !duplicated(i)) | !is.na(i))),]
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 e 5 NA
#6 c 3 200
#7 d 4 201
A possible approach with data.table using update join then full outer merge:
merge(df1[is.na(value), value := df2[.SD, on=.(id, id2), x.value]], df2, all=TRUE)
output:
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
6: e 5 NA
data:
library(data.table)
df1 <- data.table(id=c('a', 'b', 'c', 'd', 'e'), id2=c(1,2,3,4,5),value=c(100, 101, 50, NA, NA))
df2 <- data.table(id=c('c', 'd', 'e'), id2=c(3,4, 5), value=c(200, 201, NA))
Here is one way using full_join and gather
library(dplyr)
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value"), na.rm = TRUE) %>%
select(-key)
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#7 c 3 200
#8 d 4 201
For the updated case, we can do
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value")) %>%
group_by(id, id2) %>%
filter((all(is.na(value)) & !duplicated(value)) | !is.na(value)) %>%
select(-key)
# id id2 value
# <chr> <int> <int>
#1 a 1 100
#2 b 2 101
#3 c 3 50
#4 e 5 NA
#5 c 3 200
#6 d 4 201

Find time-lag between groups in a data.frame

Let's suppose I want to estimate the time lag between two groups within a data.frame.
Here an example of my data:
df_1 = data.frame(time = c(1,3,5,6,8,11,15,16,18,20), group = 'a') # create group 'a' data
df_2 = data.frame(time = c(2,7,10,13,19,25), group = 'b') # create group 'b' data
df = rbind(df_1, df_2) # merge groups
df = df[with(df, order(time)), ] # order by time
rownames(df) = NULL #remove row names
> df
time group
1 1 a
2 2 b
3 3 a
4 5 a
5 6 a
6 7 b
7 8 a
8 10 b
9 11 a
10 13 b
11 15 a
12 16 a
13 18 a
14 19 b
15 20 a
16 25 b
Now I need to subtract the time observation from group b to the time observation from group a.
i.e. 2-1, 7-6, 10-8, 13-11, 19-18 and 25-20.
# Expected output
> out
[1] 1 1 2 2 1 5
How can I achieve this?
We can find indices of b and subtract the time value from it's previous index.
inds <- which(df$group == "b")
df$time[inds] - df$time[inds - 1]
#[1] 1 1 2 2 1 5
Here's a tidyverse solution. First add a column by basic logic of the appearance of group b with transmute and a subtraction of the preceding column. Then filter to just the results, and convert to vector with deframe
library(tidyverse)
df %>%
transmute(result = if_else(group == "b", time - lag(time), 0)) %>%
filter(result != 0) %>%
deframe()
result:
[1] 1 1 2 2 1 5

Realization of a data frame with the smallest values grouped by a column

How can I create a new data frame with the smallest values group by a column.
For example this df:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 0
B 6
C 1
D 0
D 4')
Now with:
test <- setDT(df)[, .SD[which.min(Value)], by=Gene]
I get this:
> test
Gene Value
1: A 10
2: B 0
3: C 1
4: D 0
But how can I use a second condition for Value > 0 here? I want to have this output:
> test
Gene Value
1: A 10
2: B 3
3: C 1
4: D 4
Could do:
setDT(df)[, .(Value = min(Value[Value > 0])), by=Gene]
Output:
Gene Value
1: A 10
2: B 3
3: C 1
4: D 4
Using tidyverse you can group, filter and then summarize the min value:
library(tidyverse)
df2 <- df %>%
group_by(Gene) %>%
filter(Value != 0) %>%
summarise(Value = min(Value))
# A tibble: 4 x 2
Gene Value
<fct> <dbl>
1 A 10
2 B 3
3 C 1
4 D 4
Using aggregate from base R
aggregate(Value ~ Gene, subset(df, Value > 0), min)
# Gene Value
#1 A 10
#2 B 3
#3 C 1
#4 D 4

In R, split a dataframe so subset dataframes contain last row of previous dataframe and first row of subsequent dataframe

There are many answers for how to split a dataframe, for example How to split a data frame?
However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.
Here's an example
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)
n group
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
8 8 c
9 9 c
I'd like the output to look like:
d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
d <- list(d1, d2, d3)
d
[[1]]
n group
1 1 a
2 2 a
3 3 a
4 4 b
[[2]]
n group
1 3 a
2 4 b
3 5 b
4 6 b
5 7 c
[[3]]
n group
1 6 b
2 7 c
3 8 c
4 9 c
What is an efficient way to accomplish this task?
Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.
n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)
$a
n group
1 1 a
2 2 a
3 3 a
4 4 b
$b
n group
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
$c
n group
6 6 b
7 7 c
8 8 c
9 9 c
Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".
by(data = df, INDICES = df$group, function(x){
id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
na.omit(df[id, ])
})
# df$group: a
# n group
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# --------------------------------------------------------------------------------
# df$group: b
# n group
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# --------------------------------------------------------------------------------
# df$group: c
# n group
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c
Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).
I was going to comment under #cdetermans answer but its too late now.
You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like
library(data.table) # v1.9.6+
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
# n group
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 b
#
# [[2]]
# n group
# 1: 3 a
# 2: 4 b
# 3: 5 b
# 4: 6 b
# 5: 7 c
#
# [[3]]
# n group
# 1: 6 b
# 2: 7 c
# 3: 8 c
# 4: 9 c
Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.
library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]
library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]
}
[[1]]
n group
1: 1 a
2: 2 a
3: 3 a
4: 4 b
[[2]]
n group
1: 3 a
2: 4 b
3: 5 b
4: 6 b
5: 7 c
[[3]]
n group
1: 6 b
2: 7 c
3: 8 c
4: 9 c
Here is another dplyr way:
library(dplyr)
data =
data_frame(n = n, group) %>%
group_by(group)
firsts =
data %>%
slice(1) %>%
ungroup %>%
mutate(new_group = lag(group)) %>%
slice(-1)
lasts =
data %>%
slice(n()) %>%
ungroup %>%
mutate(new_group = lead(group)) %>%
slice(-n())
bind_rows(firsts, data, lasts) %>%
mutate(final_group =
ifelse(is.na(new_group),
group,
new_group) ) %>%
arrange(final_group, n) %>%
group_by(final_group)

Resources