This is an example of my data frame with a list in it.
df <- data.frame(x = 1:2, y = c("A", "B"))
df$z <- rep(list(1:3), 2)
df
> df
x y z
1 1 A 1, 2, 3
2 2 B 1, 2, 3
I would like to kind of unlist the list and rearrange the data frame as following:
x y z
1 1 A 1
2 1 A 2
3 1 A 3
4 2 B 1
5 2 B 2
6 2 B 3
Tried unlist(df) but could not get it right.
tidyr::unnest(df,z) # or even unnest_longer(df, z)
# A tibble: 6 × 3
x y z
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 1 A 3
4 2 B 1
5 2 B 2
6 2 B 3
Maybe try something like this:
require(dplyr)
df %>%
apply(1, function(x) expand.grid(x[1][[1]], x[2][[1]], x[3][[1]])) %>%
reduce(rbind)
Related
In recoding values of numeric variables like var1 below into character values, sometimes there is an easy patter. For example, suppose numeric values 1:4 in var1 need to be recoded as LETTERS[27-(4:1)], respectively.
In such situations, is it possible to avoid writing var1 = recode(var1,1="W",2="X",3="Y",4="Z") and instead loop the recoding?
library(tidyverse)
(dat <- data.frame(var1 = rep(1:4,2), id = 1:8))
mutate(dat, var1 = recode(var1,`1`="W",`2`="X",`3`="Y",`4`="Z")) # This works but can we
# loop it as well?
We can use a vectorized approach, no loops necessary. tail and base subsetting with [ will do the trick here.
library(dplyr)
dat %>% mutate(var1=tail(LETTERS, max(var1))[var1] %>% as.factor)
var1 id
1 W 1
2 X 2
3 Y 3
4 Z 4
5 W 5
6 X 6
7 Y 7
8 Z 8
data
dat <- data.frame(var1 = rep(1:4,2), id = 1:8)
data2
dat2 <- data.frame(var1 = c(2,1,3,1,4:1), id = 1:8))
var1 id
1 2 1
2 1 2
3 3 3
4 1 4
5 4 5
6 3 6
7 2 7
8 1 8
output2
var1 id
1 X 1
2 W 2
3 Y 3
4 W 4
5 Z 5
6 Y 6
7 X 7
8 W 8
You can use -
library(dplyr)
dat %>% mutate(var1 = LETTERS[length(LETTERS)-max(var1) + var1])
# var1 id
#1 W 1
#2 X 2
#3 Y 3
#4 Z 4
#5 W 5
#6 X 6
#7 Y 7
#8 Z 8
you can also just use the labels argument of factor()
library(dplyr)
dat <- data.frame(var1 = rep(1:4,2), id = 1:8) %>%
mutate(var1 = factor(var1, labels = tail(LETTERS, 4)))
dat
var1 id
1 W 1
2 X 2
3 Y 3
4 Z 4
5 W 5
6 X 6
7 Y 7
8 Z 8
I've been struggling trying to add a new column if it does not exist. I found the answer in here: Adding column if it does not exist .
However, in my problem I must use it inside purrr environment. I tried to adapt the above answer, but it doesn't fit my needs.
Here is an example what I'm dealing with:
Suppose I have a list of two data.frames:
library(tibble)
A = tibble(
x = 1:5, y = 1, z = 2
)
B = tibble(
x = 5:1, y = 3, z = 3, w = 7
)
dt_list = list(A, B)
The column I'd like to add is w:
cols = c(w = NA_real_)
Separately, if I want to add a column if it does not exist, I could do the following:
Since it does exist, not columns is added:
B %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 5 3 3 7
2 4 3 3 7
3 3 3 3 7
4 2 3 3 7
5 1 3 3 7
In this case, since it does not exist, w is added:
A %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 1 1 2 NA
2 2 1 2 NA
3 3 1 2 NA
4 4 1 2 NA
5 5 1 2 NA
I tried the following to replicate it using purrr (I'd prefer not to use a for loop):
dt_list_2 = dt_list %>%
purrr::map(
~dplyr::select(., -starts_with("x")) %>%
~tibble::add_column(!!!cols[!names(cols) %in% names(.)])
)
But the output is not the same as doing it separately.
Note: This is an example of my real problem. In fact, I'm using purrr to read many *.csv files and then apply some data transformation. Something like this:
re_file <- list.files(path = dir_path, pattern = "*.csv")
cols_add = c(UCI = NA_real_)
file_list = re_file %>%
purrr::map(function(file_name){ # iterate through each file name
read_csv(file = paste0(dir_path, "//",file_name), skip = 2)
}) %>%
purrr::map(
~dplyr::select(., -starts_with("Textbox")) %>%
~dplyr::tibble(!!!cols[!names(cols) %in% names(.)])
)
You can use :
dt_list %>%
purrr::map(
~tibble::add_column(., !!!cols[!names(cols) %in% names(.)])
)
#[[1]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 1 1 2 NA
#2 2 1 2 NA
#3 3 1 2 NA
#4 4 1 2 NA
#5 5 1 2 NA
#[[2]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 5 3 3 7
#2 4 3 3 7
#3 3 3 3 7
#4 2 3 3 7
#5 1 3 3 7
I have 2 dataframes in R (df1, df2).
A C D
1 1 1
2 2 2
df2 as
A B C
1 1 1
2 2 2
How can I merge these 2 dataframes to produce the following output?
A B C D
2 1 2 1
4 2 4 2
Columns are sorted and column values are added. Both DFs have same number of rows. Thank you in advance.
Code to create DF:
df1 <- data.frame("A" = 1:2, "C" = 1:2, "D" = 1:2)
df2 <- data.frame("A" = 1:2, "B" = 1:2, "C" = 1:2)
nm1 = names(df1)
nm2 = names(df2)
nm = intersect(nm1, nm2)
if (length(nm) == 0){ # if no column names in common
cbind(df1, df2)
} else { # if column names in common
cbind(df1[!nm1 %in% nm2], # columns only in df1
df1[nm] + df2[nm], # add columns common to both
df2[!nm2 %in% nm1]) # columns only in df2
}
# D A C B
#1 1 2 2 1
#2 2 4 4 2
You can try:
library(tidyverse)
list(df2, df1) %>%
map(rownames_to_column) %>%
bind_rows %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 2 x 5
rowname A B C D
<chr> <int> <int> <int> <int>
1 1 2 1 2 1
2 2 4 2 4 2
By using left_join() from dplyr you won't lose the column
library(tidyverse)
dat1 <- tibble(a = 1:10,
b = 1:10,
c = 1:10)
dat2 <- tibble(c = 1:10,
d = 1:10,
e = 1:10)
left_join(dat1, dat2, by = "c")
#> # A tibble: 10 x 5
#> a b c d e
#> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1
#> 2 2 2 2 2 2
#> 3 3 3 3 3 3
#> 4 4 4 4 4 4
#> 5 5 5 5 5 5
#> 6 6 6 6 6 6
#> 7 7 7 7 7 7
#> 8 8 8 8 8 8
#> 9 9 9 9 9 9
#> 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
allnames <- sort(unique(c(names(df1), names(df2))))
df3 <- data.frame(matrix(0, nrow = nrow(df1), ncol = length(allnames)))
names(df3) <- allnames
df3[,allnames %in% names(df1)] <- df3[,allnames %in% names(df1)] + df1
df3[,allnames %in% names(df2)] <- df3[,allnames %in% names(df2)] + df2
df3
A B C D
1 2 1 2 1
2 4 2 4 2
Here is a fun base R method with Reduce.
Reduce(cbind,
list(Reduce("+", list(df1[intersect(names(df1), names(df2))],
df2[intersect(names(df1), names(df2))])), # sum results
df1[setdiff(names(df1), names(df2))], # in df1, not df2
df2[setdiff(names(df2), names(df1))])) # in df2, not df1
This returns
A C D B
1 2 2 1 1
2 4 4 2 2
This assumes that both df1 and df2 have columns that are not present in the other. If this is not true, you'd have to adjust the list.
Note also that you could replace Reduce with do.call in both places and you'd get the same result.
Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you
We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200
An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL
We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))
I would like to transform a list like this:
l <- list(x = c(1, 2), y = c(3, 4, 5))
into a tibble like this:
Name Value
x 1
x 2
y 3
y 4
y 5
I think nothing will be easier than using the stack-function from base R:
df <- stack(l)
gives you a dataframe back:
> df
values ind
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
Because you asked for tibble as output, you can do as_tibble(df) (from the tibble-package) to get that.
Or more directly: df <- as_tibble(stack(l)).
Another pure base R method:
df <- data.frame(ind = rep(names(l), lengths(l)), value = unlist(l), row.names = NULL)
which gives a similar result:
> df
ind value
1 x 1
2 x 2
3 y 3
4 y 4
5 y 5
The row.names = NULL isn't necessarily needed but gives rownumbers as rownames.
Update
I found a better solution.
This works both in case of simple and complicated lists like the one I posted before (below)
l %>% map_dfr(~ .x %>% as_tibble(), .id = "name")
give us
# A tibble: 5 x 2
name value
<chr> <dbl>
1 x 1.
2 x 2.
3 y 3.
4 y 4.
5 y 5.
==============================================
Original answer
From tidyverse:
l %>%
map(~ as_tibble(.x)) %>%
map2(names(.), ~ add_column(.x, Name = rep(.y, nrow(.x)))) %>%
bind_rows()
give us
# A tibble: 5 × 2
value Name
<dbl> <chr>
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
The stack function from base R is great for simple lists as Jaap showed.
However, with more complicated lists like:
l <- list(
a = list(num = 1:3, let_a = letters[1:3]),
b = list(num = 101:103, let_b = letters[4:6]),
c = list()
)
we get
stack(l)
values ind
1 1 a
2 2 a
3 3 b
4 a b
5 b a
6 c a
7 101 b
8 102 b
9 103 a
10 d a
11 e b
12 f b
which is wrong.
The tidyverse solution shown above works fine, keeping the data from different elements of the nested list separated:
# A tibble: 6 × 4
num let Name lett
<int> <chr> <chr> <chr>
1 1 a a <NA>
2 2 b a <NA>
3 3 c a <NA>
4 101 <NA> b d
5 102 <NA> b e
6 103 <NA> b f
We can use melt from reshape2
library(reshape2)
melt(l)
# value L1
#1 1 x
#2 2 x
#3 3 y
#4 4 y
#5 5 y