I have a data frame of orders and receivables with lead times.
Can I use dplyr to fill in the receive column according to the groups lead time?
df <- data.frame(team = c("a","a","a","a", "a", "b", "b", "b", "b", "b"),
order = c(2, 4, 3, 5, 6, 7, 8, 5, 4, 5),
lead_time = c(3, 3, 3, 3, 3, 2, 2, 2, 2, 2))
>df
team order lead_time
a 2 3
a 4 3
a 3 3
a 5 3
a 6 3
b 7 2
b 8 2
b 5 2
b 4 2
b 5 2
And adding a receive column like so:
dfb <- data.frame(team = c("a","a","a","a", "a", "b", "b", "b", "b", "b"),
order = c(2, 4, 3, 5, 6, 7, 8, 5, 4, 5),
lead_time = c(3, 3, 3, 3, 3, 2, 2, 2, 2, 2),
receive = c(0, 0, 0, 2, 4, 0, 0, 7, 8, 5))
>dfb
team order lead_time receive
a 2 3 0
a 4 3 0
a 3 3 0
a 5 3 2
a 6 3 4
b 7 2 0
b 8 2 0
b 5 2 7
b 4 2 8
b 5 2 5
I was thinking along these lines but run into an error
dfc <- df %>%
group_by(team) %>%
mutate(receive = if_else( row_number() < lead_time, 0, lag(order, n = lead_time)))
Error in mutate_impl(.data, dots) :
could not convert second argument to an integer. type=SYMSXP, length = 1
Thanks for the help!
This looks like a bug; There might be some unintended mask of the lag function between dplyr and stats package, try this work around:
df %>%
group_by(team) %>%
# explicitly specify the source of the lag function here
mutate(receive = dplyr::lag(order, n=unique(lead_time), default=0))
#Source: local data frame [10 x 4]
#Groups: team [2]
# team order lead_time receive
# <fctr> <dbl> <dbl> <dbl>
#1 a 2 3 0
#2 a 4 3 0
#3 a 3 3 0
#4 a 5 3 2
#5 a 6 3 4
#6 b 7 2 0
#7 b 8 2 0
#8 b 5 2 7
#9 b 4 2 8
#10 b 5 2 5
We can also use shift from data.table
library(data.table)
setDT(df)[, receive := shift(order, n = lead_time[1], fill=0), by = team]
df
# team order lead_time receive
# 1: a 2 3 0
# 2: a 4 3 0
# 3: a 3 3 0
# 4: a 5 3 2
# 5: a 6 3 4
# 6: b 7 2 0
# 7: b 8 2 0
# 8: b 5 2 7
# 9: b 4 2 8
#10: b 5 2 5
Related
I have 5 data frames. I want to recode all variables ending with "_comfort", "_agree", and "effective" using the same rules for each data frame. As is, the values in each column are 1:5 and I want is to recode 5's to 1, 4's to 2, 2's to 4, and 5's to 1 (3 will stay the same).
I do not want the end result to one merged dataset, but instead to apply the same recoding rules across all 5 independent data frames. For simplicity sake, let's just assume I have 2 data frames:
df1 <- data.frame(a_comfort = c(1, 2, 3, 4, 5),
b_comfort = c(1, 2, 3, 4, 5),
c_effective = c(1, 2, 3, 4, 5))
df2 <- data.frame(a_comfort = c(1, 2, 3, 4, 5),
b_comfort = c(1, 2, 3, 4, 5),
c_effective = c(1, 2, 3, 4, 5))
What I want is:
df1 <- data.frame(a_comfort = c(5, 4, 3, 2, 1),
b_comfort = c(5, 4, 3, 2, 1),
c_effective = c(5, 4, 3, 2, 1))
df2 <- data.frame(a_comfort = c(5, 4, 3, 2, 1),
b_comfort = c(5, 4, 3, 2, 1),
c_effective = c(5, 4, 3, 2, 1))
Conventionally, I would use dplyr's mutate_atand ends_withto achieve my goal, but have not been successful with this method across multiple data frames. I am thinking a combination of the purr and dplyr packages will work, but haven't nailed down the exact technique.
Thanks in advance for any help!
You can use get() and assign() in a loop:
library(dplyr)
for (df_name in c("df1", "df2")) {
df <- mutate(
get(df_name),
across(
ends_with(c("_comfort", "_agree", "_effective")),
\(x) 6 - x
)
)
assign(df_name, df)
}
Result:
#> df1
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
#> df2
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
Note, however, it’s often better practice to keep multiple related dataframes in a list than loose in the global environment (see). In this case, you can use purrr::map() (or base::lapply()):
library(dplyr)
library(purrr)
dfs <- list(df1, df2)
dfs <- map(
dfs,
\(df) mutate(
df,
across(
ends_with(c("_comfort", "_agree", "_effective")),
\(x) 6 - x
)
)
)
Result:
#> dfs
[[1]]
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
[[2]]
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
You can use ls(pattern = 'df\\d+') to find all objects whose names match a certain pattern. Then store them into a list and pass to purrr::map or lapply for recoding.
library(dplyr)
df.lst <- purrr::map(
mget(ls(pattern = 'df\\d+')),
~ .x %>% mutate(6 - across(ends_with(c("_comfort", "_agree", "effective"))))
)
# $df1
# a_comfort b_comfort c_effective
# 1 5 5 5
# 2 4 4 4
# 3 3 3 3
# 4 2 2 2
# 5 1 1 1
#
# $df2
# a_comfort b_comfort c_effective
# 1 5 5 5
# 2 4 4 4
# 3 3 3 3
# 4 2 2 2
# 5 1 1 1
You can further overwrite those dataframes in your workspace from the list through list2env().
list2env(df.lst, .GlobalEnv)
Please try the below code where i convert the columns to factor and then recode them
data
a_comfort b_comfort c_effective
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
code
library(tidyverse)
df1 %>% mutate(across(c(ends_with('comfort'),ends_with('effective')), ~ factor(.x, levels=c('1','2','3','4','5'), labels=c('5','4','3','2','1'))))
output
a_comfort b_comfort c_effective
1 5 5 5
2 4 4 4
3 3 3 3
4 2 2 2
5 1 1 1
I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)
I have the following list
example <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
that I'd like to transform into the following tibble
# A tibble: 9 × 2
name value
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 2
5 b 3
6 c 3
7 c 4
8 c 5
9 c 6
I've found multiple StackOverflow questions on this subject like here, here or here, but none is adressing this particular case where the name of the vector is not expected to become a column name.
I managed to achieve the desired result with a good old loop like below, but I'm looking for a faster and more elegant way.
library(dplyr)
example_list <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
example_tibble <- tibble()
for (i in 1:length(example_list)) {
example_tibble <- example_tibble %>%
bind_rows(as_tibble(example_list[[i]]) %>%
mutate(name = names(example_list)[[i]]))
}
example_tibble <- example_tibble %>%
relocate(name)
Try stack
> stack(example)
values ind
1 1 a
2 2 a
3 3 a
4 2 b
5 3 b
6 3 c
7 4 c
8 5 c
9 6 c
example <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
library(tidyverse)
enframe(example) %>%
unnest(value)
#> # A tibble: 9 x 2
#> name value
#> <chr> <dbl>
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 2
#> 5 b 3
#> 6 c 3
#> 7 c 4
#> 8 c 5
#> 9 c 6
Created on 2021-11-04 by the reprex package (v2.0.1)
I have a dataframe like this
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("Love_ABC", "Love_CNN", "Hate_ABC", "Hate_CNN", "Love_CNBC", "Hate_CNBC"), row.names = c(NA,
8L), class = "data.frame")
I have made the following for loop
channels = c("ABC", "CNN", "CNBC")
for (channel in channels) {
dataframe <- dataframe %>%
mutate(ALL_channel = Love_channel + Hate_channel)
}
But when i run the for loop R tells me " object Love_channel" not found. Have i done something wrong in the for loop?
Here's a way with rlang. Note, reshaping the data is likely more straightforward. Non-standard evaluation (NSE) is a complicated topic.
for (channel in channels) {
DF <- DF %>%
mutate(!!sym(paste0("ALL_", channel)) := !!sym(paste0("Love_", channel)) + !!sym(paste0("Hate_", channel)))
}
DF
## Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC ALL_ABC ALL_CNN ALL_CNBC
## 1 1 1 6 6 1 2 7 7 3
## 2 3 3 3 2 2 3 6 5 5
## 3 4 4 6 4 4 4 10 8 8
## 4 6 2 5 5 5 2 11 7 7
## 5 3 6 3 3 6 2 6 9 8
## 6 2 7 6 7 7 7 8 14 14
## 7 5 2 5 2 6 5 10 4 11
## 8 1 6 3 6 3 2 4 12 5
This is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dataframe <- dataframe %>%
tibble::rowid_to_column()
dataframe %>%
pivot_longer(-rowid, names_to = c(NA, "channel"), names_sep = "_") %>%
pivot_wider(names_from = channel, names_prefix = "ALL_", values_from = value, values_fn = sum) %>%
right_join(dataframe, by = "rowid") %>%
select(-rowid)
#> # A tibble: 8 x 9
#> ALL_ABC ALL_CNN ALL_CNBC Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 7 7 3 1 1 6 6 1 2
#> 2 6 5 5 3 3 3 2 2 3
#> 3 10 8 8 4 4 6 4 4 4
#> 4 11 7 7 6 2 5 5 5 2
#> 5 6 9 8 3 6 3 3 6 2
#> 6 8 14 14 2 7 6 7 7 7
#> 7 10 4 11 5 2 5 2 6 5
#> 8 4 12 5 1 6 3 6 3 2
The idea is to reshape it to make the sums easier. Then you can join the final result back to the initial dataframe.
start by uniquely identifying each row with a rowid.
reshape with pivot_longer so to have all values neatly in one column. In this step you also separate the names Love/Hate_channel in two and you remove the Love/Hate part (you are interested only on the channel) [that is what the NA does!].
reshape again: this time you want to get one column for each channel. In this step you also sum up what previously was Love and Hate together for each rowid and channel (that's what values_fn=sum does!). Also you add a prefix (names_prefix = "ALL_") to each new column name to have names that respect your expected final result.
with right_join you add the values back to the original dataframe. You have no need for rowid now, so you can remove it.
I have a data set with repeating rows. I want to remove consecutive repeated and count them but only if they're consecutive. I'm looking for an efficient way to do this. Can't think of how in dplyr or data.table.
MWE
dat <- data.frame(
x = c(6, 2, 3, 3, 3, 1, 1, 6, 5, 5, 6, 6, 5, 4),
y = c(7, 5, 7, 7, 7, 5, 5, 7, 1, 2, 7, 7, 1, 7),
z = c(rep(LETTERS[1:2], each=7))
)
## x y z
## 1 6 7 A
## 2 2 5 A
## 3 3 7 A
## 4 3 7 A
## 5 3 7 A
## 6 1 5 A
## 7 1 5 A
## 8 6 7 B
## 9 5 1 B
## 10 5 2 B
## 11 6 7 B
## 12 6 7 B
## 13 5 1 B
## 14 4 7 B
Desired output
x y z n
1 6 7 A 1
2 2 5 A 1
3 3 7 A 3
4 1 5 A 2
5 6 7 B 1
6 5 1 B 1
7 5 2 B 1
8 6 7 B 2
9 5 1 B 1
10 4 7 B 1
With data.table:
library(data.table)
setDT(dat)
dat[, c(.SD[1L], .N), by=.(g = rleidv(dat))][, g := NULL]
x y z N
1: 6 7 A 1
2: 2 5 A 1
3: 3 7 A 3
4: 1 5 A 2
5: 6 7 B 1
6: 5 1 B 1
7: 5 2 B 1
8: 6 7 B 2
9: 5 1 B 1
10: 4 7 B 1
Similar to Ricky's answer, here's another base solution:
with(rle(do.call(paste, dat)), cbind(dat[ cumsum(lengths), ], lengths))
In case paste doesn't cut it for the column classes you have, you can do
ud = unique(dat)
ud$r = seq_len(nrow(ud))
dat$r0 = seq_len(nrow(dat))
newdat = merge(dat, ud)
with(rle(newdat[order(newdat$r0), ]$r), cbind(dat[cumsum(lengths), ], lengths))
... though I'm guessing there's some better way.
With dplyr, you can borrow data.table::rleid to make a run ID column, then use n to count rows and unique to chop out repeats:
dat %>% group_by(run = data.table::rleid(x, y, z)) %>% mutate(n = n()) %>%
distinct() %>% ungroup() %>% select(-run)
You can replace rleid with just base R, if you like, but it's not as pretty:
dat %>% group_by(run = rep(seq_along(rle(paste(x, y, z))$len),
times = rle(paste(x, y, z))$len)) %>%
mutate(n = n()) %>% distinct() %>% ungroup() %>% select(-run)
Either way, you get:
Source: local data frame [10 x 4]
x y z n
(dbl) (dbl) (fctr) (int)
1 6 7 A 1
2 2 5 A 1
3 3 7 A 3
4 1 5 A 2
5 6 7 B 1
6 5 1 B 1
7 5 2 B 1
8 6 7 B 2
9 5 1 B 1
10 4 7 B 1
Edit
Per #Frank's comment, you can also use summarise to insert n and collapse instead of mutate and unique if you group_by all the variables you want to keep before run, as summarise collapses the last group. One advantage to this approach is that you don't have to ungroup to get rid of run, as summarise does for you:
dat %>% group_by(x, y, z, run = data.table::rleid(x, y, z)) %>%
summarise(n = n()) %>% select(-run)
A base solution below
idx <- rle(with(dat, paste(x, y, z)))
d <- cbind(do.call(rbind, strsplit(idx$values, " ")), idx$lengths)
as.data.frame(d)
V1 V2 V3 V4
1 6 7 A 1
2 2 5 A 1
3 3 7 A 3
4 1 5 A 2
5 6 7 B 1
6 5 1 B 1
7 5 2 B 1
8 6 7 B 2
9 5 1 B 1
10 4 7 B 1
If you have a large dataset, you could use a similar idea to Frank's data.table solution, but avoid using .SD like this:
dat[, g := rleidv(dat)][, N := .N, keyby = g
][J(unique(g)), mult = "first"
][, g := NULL
][]
It's less readable, and it turns out it's slower, too. Frank's solution is faster and more readable.
# benchmark on 14 million rows
dat <- data.frame(
x = rep(c(6, 2, 3, 3, 3, 1, 1, 6, 5, 5, 6, 6, 5, 4), 1e6),
y = rep(c(7, 5, 7, 7, 7, 5, 5, 7, 1, 2, 7, 7, 1, 7), 1e6),
z = rep(c(rep(LETTERS[1:2], each=7)), 1e6)
)
setDT(dat)
d1 <- copy(dat)
d2 <- copy(dat)
With R 3.2.4 and data.table 1.9.7 (on Frank's computer):
system.time(d1[, c(.SD[1L], .N), by=.(g = rleidv(d1))][, g := NULL])
# user system elapsed
# 0.42 0.10 0.52
system.time(d2[, g := rleidv(d2)][, N := .N, keyby = g][J(unique(g)), mult = "first"][, g := NULL][])
# user system elapsed
# 2.48 0.25 2.74
Not much different than the other answers, but (1) having ordered data and (2) looking for consecutive runs seems a good candidate for, just, ORing x[-1L] != x[-length(x)] accross columns instead of pasteing or other complex operations. I guess this is, somehow, equivalent to data.table::rleid.
ans = logical(nrow(dat) - 1L)
for(j in seq_along(dat)) ans[dat[[j]][-1L] != dat[[j]][-nrow(dat)]] = TRUE
ans = c(TRUE, ans)
#or, the two-pass, `c(TRUE, Reduce("|", lapply(dat, function(x) x[-1L] != x[-length(x)])))`
cbind(dat[ans, ], n = tabulate(cumsum(ans)))
# x y z n
#1 6 7 A 1
#2 2 5 A 1
#3 3 7 A 3
#6 1 5 A 2
#8 6 7 B 1
#9 5 1 B 1
#10 5 2 B 1
#11 6 7 B 2
#13 5 1 B 1
#14 4 7 B 1
Another base attempt using ave, just because:
dat$grp <- ave(
seq_len(nrow(dat)),
dat[c("x","y","z")],
FUN=function(x) cumsum(c(1,diff(x))!=1)
)
dat$count <- ave(dat$grp, dat, FUN=length)
dat[!duplicated(dat[1:4]),]
# x y z grp count
#1 6 7 A 0 1
#2 2 5 A 0 1
#3 3 7 A 0 3
#6 1 5 A 0 2
#8 6 7 B 0 1
#9 5 1 B 0 1
#10 5 2 B 0 1
#11 6 7 B 1 2
#13 5 1 B 1 1
#14 4 7 B 0 1
And a data.table conversion attempt:
d1[, .(sq=.I, grp=cumsum(c(1, diff(.I)) != 1)), by=list(x,y,z)][(sq), .N, by=list(x,y,z,grp)]