I have some data as follows:
library(tidyr)
library(data.table)
thisdata <- data.frame(numbers = c(1,3,4,5,6,1,2,4,5,6)
,letters = c('A','A','A','A','A','B','B','B','B','B'))
otherdata <- data.frame(numbers = c(1,2,3,4,5,6))
I am looking to split 'thisdata' by the letters column, merge the two lists to 'otherdata' by the numbers column, then fill letters NA with the corresponding letter in that list. So:
out <- split(thisdata , f = thisdata$letters )
out2 <- lapply(out, function(x) merge(x,otherdata,by="numbers",all = TRUE))
However, I can't get the 'fill' function in tidyr to work within the lapply
out3 <- lapply(out2,function(x) fill(x$channel))
Error in UseMethod("fill_") :
no applicable method for 'fill_' applied to an object of class "NULL"
This is the output I'm after, but would rather perform the calculation within the list format:
out4 <- rbindlist(out2)
out5 <- out4 %>%
fill(letters) %>% #default direction down
fill(letters,.direction = "up")
numbers letters
1: 1 A
2: 2 A
3: 3 A
4: 4 A
5: 5 A
6: 6 A
7: 1 B
8: 2 B
9: 3 B
10: 4 B
11: 5 B
12: 6 B
fill expects a data frame as first parameter, try fill(x, letters) or x %>% fill(letters) with magrittr pipe:
out3 <- lapply(out2,function(x) fill(x, letters))
out3
#$A
# numbers letters
#1 1 A
#2 2 A
#3 3 A
#4 4 A
#5 5 A
#6 6 A
#$B
# numbers letters
#1 1 B
#2 2 B
#3 3 B
#4 4 B
#5 5 B
#6 6 B
A simpler method is use tidyr::complete:
thisdata %>%
complete(numbers = otherdata$numbers, letters) %>%
arrange(letters)
# A tibble: 12 x 2
# numbers letters
# <dbl> <fctr>
# 1 1 A
# 2 2 A
# 3 3 A
# 4 4 A
# 5 5 A
# 6 6 A
# 7 1 B
# 8 2 B
# 9 3 B
#10 4 B
#11 5 B
#12 6 B
Related
Given two dataframes with the same column names:
a <- data.frame(x=1:4,y=5:8)
b <- data.frame(x=LETTERS[1:4],y=LETTERS[5:8])
>a
x y
1 5
2 6
3 7
4 8
>b
x y
A E
B F
C G
D H
How can each column with the same name be concatentated?
Desired output:
cat_x cat_y
1 A 5 E
2 B 6 F
3 C 7 G
4 D 8 H
Tried so far, merging columns one at a time:
a$cat_x <- paste(a$x,b$x)
a$cat_y <- paste(a$y,b$y)
This approach works, but the real data has 40 columns (and will include multiple more dataframes). Looking for a more efficient method for larger dataframes.
We may use Map to do this on a loop
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b,
MoreArgs = list(sep = "_")))
-output
cat_x cat_y
1 1_A 5_E
2 2_B 6_F
3 3_C 7_G
4 4_D 8_H
Used sep above in case we want to add a delimiter. Or else by default it will be space
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b ))
cat_x cat_y
1 1 A 5 E
2 2 B 6 F
3 3 C 7 G
4 4 D 8 H
Another possible solution, using purrr::map2_dfc:
library(tidyverse)
map2_dfc(a,b, ~ str_c(.x, .y, sep = " ")) %>%
rename_with(~ str_c("cat", .x, sep = "_"))
#> # A tibble: 4 × 2
#> cat_x cat_y
#> <chr> <chr>
#> 1 1 A 5 E
#> 2 2 B 6 F
#> 3 3 C 7 G
#> 4 4 D 8 H
In the example below how can I calculate the row mean when column A is NA? The row mean would replace the NA in column A. Using base R, I can use this:
foo <- tibble(A = c(3,5,NA,6,NA,7,NA),
B = c(4,5,4,5,6,4,NA),
C = c(6,5,2,8,8,5,NA))
foo
tmp <- rowMeans(foo[,-1],na.rm = TRUE)
foo$A[is.na(foo$A)] <- tmp[is.na(foo$A)]
foo$A[is.nan(foo$A)] <- NA
Curious how I might do this with dplyR?
You can use ifelse :
library(dplyr)
foo %>%
mutate(A = ifelse(is.na(A), rowMeans(., na.rm = TRUE), A),
A = replace(A, is.nan(A), NA))
# A B C
# <dbl> <dbl> <dbl>
#1 3 4 6
#2 5 5 5
#3 3 4 2
#4 6 5 8
#5 7 6 8
#6 7 4 5
#7 NA NA NA
Here is a solution that not only replace NA in column A, but for all columns in the data frame.
library(dplyr)
foo2 <- foo %>%
mutate(RowMean = rowMeans(., na.rm = TRUE)) %>%
mutate(across(-RowMean, .fns =
function(x) ifelse(is.na(x) & !is.nan(RowMean), RowMean, x))) %>%
select(-RowMean)
Use coalesce:
foo %>%
mutate(m = rowMeans(across(), na.rm = T),
A = if_else(is.na(A) & !is.na(m), m, A)) %>%
select(-m)
# # A tibble: 7 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 3 4 6
# 2 5 5 5
# 3 3 4 2
# 4 6 5 8
# 5 7 6 8
# 6 7 4 5
# 7 NA NA NA
Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you
We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200
An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL
We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))
I like the bind_rows function in dplyr but I find it annoying that when passing the .id argument it can only add a numeric index in the new column.
I'm trying write a bind_rows_named function but am getting stuck accessing the object names. This works as expected:
bind_name_to_df <- function(df){
dfname <- deparse(substitute(df))
df %>% mutate(label=dfname)
}
a <- data_frame(stuff=1:10)
bind_name_to_df(a)
But I can't work out how to apply this to a list of data frames, e.g. using .dots. I want this to work, but I know I have the semantics for the ... wrong somehow. Can anyone shed light?
b <- data_frame(stuff=1:10)
bind_rows_named <- function(...){
return(
bind_rows(lapply(..., bind_name_to_df)))
}
bind_rows_named(a, b)
Here is an option using base R
bind_named <- function(...){
v1 <- sapply(match.call()[-1], deparse)
dfs <- list(...)
Map(cbind, dfs, label = v1)
}
bind_named(a, b)
#[1]]
# stuff label
#1 1 a
#2 2 a
#3 3 a
#4 4 a
#5 5 a
#6 6 a
#7 7 a
#8 8 a
#9 9 a
#10 10 a
#[[2]]
# stuff label
#1 1 b
#2 2 b
#3 3 b
#4 4 b
#5 5 b
#6 6 b
#7 7 b
#8 8 b
#9 9 b
#10 10 b
Or using tidyverse
library(tidyverse)
bind_named <- function(...) {
nm1 <- quos(...) %>%
map(quo_name)
dfs <- list(...)
dfs %>%
map2(nm1, ~mutate(., label = .y))
}
res <- bind_named(a, b)
res %>%
map(head, 2)
#[[1]]
# stuff label
#1 1 a
#2 2 a
#[[2]]
# stuff label
#1 1 b
#2 2 b
It can be also be made into a single chain
bind_named <- function(...) {
quos(...) %>%
map(quo_name) %>%
map2_df(list(...), ., ~mutate(.data = .x, label = .y))
}
bind_named(a, b)
# A tibble: 20 x 2
# stuff label
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 a
# 5 5 a
# 6 6 a
# 7 7 a
# 8 8 a
# 9 9 a
#10 10 a
#11 1 b
#12 2 b
#13 3 b
#14 4 b
#15 5 b
#16 6 b
#17 7 b
#18 8 b
#19 9 b
#20 10 b
NOTE: Initially, we thought the OP wanted to create columns on separate datasets and get a list output. Upon clarification, the map2 is changed to map2_df which return a single dataset
There are many answers for how to split a dataframe, for example How to split a data frame?
However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.
Here's an example
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)
n group
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
8 8 c
9 9 c
I'd like the output to look like:
d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
d <- list(d1, d2, d3)
d
[[1]]
n group
1 1 a
2 2 a
3 3 a
4 4 b
[[2]]
n group
1 3 a
2 4 b
3 5 b
4 6 b
5 7 c
[[3]]
n group
1 6 b
2 7 c
3 8 c
4 9 c
What is an efficient way to accomplish this task?
Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.
n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)
$a
n group
1 1 a
2 2 a
3 3 a
4 4 b
$b
n group
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
$c
n group
6 6 b
7 7 c
8 8 c
9 9 c
Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".
by(data = df, INDICES = df$group, function(x){
id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
na.omit(df[id, ])
})
# df$group: a
# n group
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# --------------------------------------------------------------------------------
# df$group: b
# n group
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# --------------------------------------------------------------------------------
# df$group: c
# n group
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c
Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).
I was going to comment under #cdetermans answer but its too late now.
You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like
library(data.table) # v1.9.6+
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
# n group
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 b
#
# [[2]]
# n group
# 1: 3 a
# 2: 4 b
# 3: 5 b
# 4: 6 b
# 5: 7 c
#
# [[3]]
# n group
# 1: 6 b
# 2: 7 c
# 3: 8 c
# 4: 9 c
Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.
library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]
library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]
}
[[1]]
n group
1: 1 a
2: 2 a
3: 3 a
4: 4 b
[[2]]
n group
1: 3 a
2: 4 b
3: 5 b
4: 6 b
5: 7 c
[[3]]
n group
1: 6 b
2: 7 c
3: 8 c
4: 9 c
Here is another dplyr way:
library(dplyr)
data =
data_frame(n = n, group) %>%
group_by(group)
firsts =
data %>%
slice(1) %>%
ungroup %>%
mutate(new_group = lag(group)) %>%
slice(-1)
lasts =
data %>%
slice(n()) %>%
ungroup %>%
mutate(new_group = lead(group)) %>%
slice(-n())
bind_rows(firsts, data, lasts) %>%
mutate(final_group =
ifelse(is.na(new_group),
group,
new_group) ) %>%
arrange(final_group, n) %>%
group_by(final_group)