R dplyr left join multiple tables without two separate columns with suffix - r

Suppose I have a main table x
x <- tibble(id = c(1,2,3,4,5), score = c(100,200,300,100,200))
x
# A tibble: 5 x 2
id score
<dbl> <dbl>
1 1 100
2 2 200
3 3 300
4 4 100
5 5 200
and two other tables
y = tibble(id = c(1,2), score_new=c(200,300))
y
# A tibble: 2 x 2
id score_new
<dbl> <dbl>
1 1 200
2 2 300
z = tibble(id = c(3,4), score_new = c(300,400))
z
# A tibble: 2 x 2
id score_new
<dbl> <dbl>
1 3 300
2 4 400
If I join them together it will be like this:
x %>% left_join(y, by =c("id" = "id")) %>% left_join(z, by =c("id" = "id"))
# A tibble: 5 x 4
id score score_new.x score_new.y
<dbl> <dbl> <dbl> <dbl>
1 1 100 200 NA
2 2 200 300 NA
3 3 300 NA 300
4 4 100 NA 400
5 5 200 NA NA
But I need score_new to be only one column. How do I do that? Sorry if there are already other similar questions but I really couldn't find them.

You can do that by appending y and z and then joining them.
# Loading required libraries
library(dplyr)
# Create sample df
x <- tibble(id = c(1,2,3,4,5), score = c(100,200,300,100,200))
y = tibble(id = c(1,2), score_new=c(200,300))
z = tibble(id = c(3,4), score_new = c(300,400))
x %>%
# union y and z and join on x to get new scores
left_join(union_all(y,z), by = "id")
Similarly you can use bind_rows instead of union_all both gives same results in this scenario.
x %>%
# union y and z and join on x to get new scores
left_join(bind_rows(y,z), by = "id")

I'm a bit late to the party. But I would opt for this tidyverse-solution,
bind_rows(
y,z
) %>% left_join(x = x)
Which gives the following output,
# A tibble: 5 x 3
id score score_new
<dbl> <dbl> <dbl>
1 1 100 200
2 2 200 300
3 3 300 300
4 4 100 400
5 5 200 NA
Note: left_join() has x and y arugments, and here Ive specified that x = x, where the rhs is your data.

You can try this approach:
mutate(score_new.x = if_else(is.na(score_new.x),score_new.y,score_new.x)) %>%
select(-score_new.y)

Related

Adding column if it does not exist inside purrr language

I've been struggling trying to add a new column if it does not exist. I found the answer in here: Adding column if it does not exist .
However, in my problem I must use it inside purrr environment. I tried to adapt the above answer, but it doesn't fit my needs.
Here is an example what I'm dealing with:
Suppose I have a list of two data.frames:
library(tibble)
A = tibble(
x = 1:5, y = 1, z = 2
)
B = tibble(
x = 5:1, y = 3, z = 3, w = 7
)
dt_list = list(A, B)
The column I'd like to add is w:
cols = c(w = NA_real_)
Separately, if I want to add a column if it does not exist, I could do the following:
Since it does exist, not columns is added:
B %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 5 3 3 7
2 4 3 3 7
3 3 3 3 7
4 2 3 3 7
5 1 3 3 7
In this case, since it does not exist, w is added:
A %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 1 1 2 NA
2 2 1 2 NA
3 3 1 2 NA
4 4 1 2 NA
5 5 1 2 NA
I tried the following to replicate it using purrr (I'd prefer not to use a for loop):
dt_list_2 = dt_list %>%
purrr::map(
~dplyr::select(., -starts_with("x")) %>%
~tibble::add_column(!!!cols[!names(cols) %in% names(.)])
)
But the output is not the same as doing it separately.
Note: This is an example of my real problem. In fact, I'm using purrr to read many *.csv files and then apply some data transformation. Something like this:
re_file <- list.files(path = dir_path, pattern = "*.csv")
cols_add = c(UCI = NA_real_)
file_list = re_file %>%
purrr::map(function(file_name){ # iterate through each file name
read_csv(file = paste0(dir_path, "//",file_name), skip = 2)
}) %>%
purrr::map(
~dplyr::select(., -starts_with("Textbox")) %>%
~dplyr::tibble(!!!cols[!names(cols) %in% names(.)])
)
You can use :
dt_list %>%
purrr::map(
~tibble::add_column(., !!!cols[!names(cols) %in% names(.)])
)
#[[1]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 1 1 2 NA
#2 2 1 2 NA
#3 3 1 2 NA
#4 4 1 2 NA
#5 5 1 2 NA
#[[2]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 5 3 3 7
#2 4 3 3 7
#3 3 3 3 7
#4 2 3 3 7
#5 1 3 3 7

R dplyr::Filter dataframe by group and numeric vector?

I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)
Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1
here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1
Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98

dplyr - How to obtain the order of one column within a group?

Example data:
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
> tibbly
# A tibble: 12 x 4
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 10 A X 1
2 30 A X 2
3 50 A X 3
4 10 A Y 4
5 30 A Y 4
6 50 A Y 6
7 10 B X 2
8 30 B X 5
9 50 B X 3
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
Question:
How to obtain the order of rows for each group in a dataframe? I can use dplyr to arrange the data in the an appropriate form to visualize what I am interested in:
> tibbly %>%
group_by(grouping1, grouping2) %>%
arrange(grouping1, grouping2, desc(value))
# A tibble: 12 x 4
# Groups: grouping1, grouping2 [4]
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 50 A X 3
2 30 A X 2
3 10 A X 1
4 50 A Y 6
5 10 A Y 4
6 30 A Y 4
7 30 B X 5
8 50 B X 3
9 10 B X 2
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
In the end I am interested in the order of the age column, for each group based on the value column. Is there a elegant way to do this with dplyr? Something like summarise() based on the order of rows and not actual values
library(dplyr)
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
tibbly %>%
group_by(grouping1, grouping2) %>% # for each group
arrange(desc(value)) %>% # arrange value descending
summarise(order = paste0(age, collapse = ",")) %>% # get the order of age as a strings
ungroup() # forget the grouping
# # A tibble: 4 x 3
# grouping1 grouping2 order
# <chr> <chr> <chr>
# 1 A X 50,30,10
# 2 A Y 50,10,30
# 3 B X 30,50,10
# 4 B Y 10,30,50
With data.table
library(data.table)
setDT(tibbly)[order(-value), .(order = toString(age)),.(grouping1, grouping2)]

List into tibble using list names as values in one column

I would like to transform a list like this:
l <- list(x = c(1, 2), y = c(3, 4, 5))
into a tibble like this:
Name Value
x 1
x 2
y 3
y 4
y 5
I think nothing will be easier than using the stack-function from base R:
df <- stack(l)
gives you a dataframe back:
> df
values ind
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
Because you asked for tibble as output, you can do as_tibble(df) (from the tibble-package) to get that.
Or more directly: df <- as_tibble(stack(l)).
Another pure base R method:
df <- data.frame(ind = rep(names(l), lengths(l)), value = unlist(l), row.names = NULL)
which gives a similar result:
> df
ind value
1 x 1
2 x 2
3 y 3
4 y 4
5 y 5
The row.names = NULL isn't necessarily needed but gives rownumbers as rownames.
Update
I found a better solution.
This works both in case of simple and complicated lists like the one I posted before (below)
l %>% map_dfr(~ .x %>% as_tibble(), .id = "name")
give us
# A tibble: 5 x 2
name value
<chr> <dbl>
1 x 1.
2 x 2.
3 y 3.
4 y 4.
5 y 5.
==============================================
Original answer
From tidyverse:
l %>%
map(~ as_tibble(.x)) %>%
map2(names(.), ~ add_column(.x, Name = rep(.y, nrow(.x)))) %>%
bind_rows()
give us
# A tibble: 5 × 2
value Name
<dbl> <chr>
1 1 x
2 2 x
3 3 y
4 4 y
5 5 y
The stack function from base R is great for simple lists as Jaap showed.
However, with more complicated lists like:
l <- list(
a = list(num = 1:3, let_a = letters[1:3]),
b = list(num = 101:103, let_b = letters[4:6]),
c = list()
)
we get
stack(l)
values ind
1 1 a
2 2 a
3 3 b
4 a b
5 b a
6 c a
7 101 b
8 102 b
9 103 a
10 d a
11 e b
12 f b
which is wrong.
The tidyverse solution shown above works fine, keeping the data from different elements of the nested list separated:
# A tibble: 6 × 4
num let Name lett
<int> <chr> <chr> <chr>
1 1 a a <NA>
2 2 b a <NA>
3 3 c a <NA>
4 101 <NA> b d
5 102 <NA> b e
6 103 <NA> b f
We can use melt from reshape2
library(reshape2)
melt(l)
# value L1
#1 1 x
#2 2 x
#3 3 y
#4 4 y
#5 5 y

how to create a variable based on lm in a regular mutate in dplyr?

Consider this simple example:
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to group by group and create two columns.
One is the number of distinct values in value (I can use dplyr::n_distinct), the other is the constant term from a regression of value on the vector 1. That is, the output of
tidy(lm(data = dataframe, value ~ 1)) %>% select(estimate)
estimate
1 203.3333
The difficulty here is combining these two simple outputs into a single mutate statement that preserves the grouping.
I tried something like:
formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% select(estimate)
}
dataframe %>% group_by(group) %>%
mutate(distinct = n_distinct(value),
mean = formula1(., value))
but this does not work. What I am missing here?
Thanks!
This approach will work if you use pull in place of select. This extracts the single estimate value from the tidy output.
formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% pull(estimate)
}
dataframe %>%
group_by(group) %>%
mutate(distinct = n_distinct(value),
mean = formula1(., value))
# A tibble: 6 x 5
# Groups: group [2]
id group value distinct mean
<dbl> <dbl> <dbl> <int> <dbl>
1 1 1 200 3 240.0000
2 2 1 400 3 240.0000
3 3 1 120 3 240.0000
4 4 2 300 2 166.6667
5 5 2 100 2 166.6667
6 6 2 100 2 166.6667

Resources