Following a question I came across today, I would like to know how I can use bind_rows function in a pipe while avoiding duplication and NA values. Consider I have the following simple tibble:
df <- tibble(
col1 = c(3, 4, 5),
col2 = c(5, 3, 1),
col3 = c(6, 4, 9),
col4 = c(9, 6, 5)
)
I would like to bind col1 & col2 row-wise with col3 & col4 so that I have a tibble with 2 columns and 6 observations. In the end changing the names of the columns to colnew1 and colnew2.
But when I use bind_rows I got the following output with a lot of duplications and NA values.
df %>%
bind_rows(
select(., 1:2),
select(., 3:4)
)
# A tibble: 9 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <dbl>
1 3 5 6 9
2 4 3 4 6
3 5 1 9 5
4 3 5 NA NA
5 4 3 NA NA
6 5 1 NA NA
7 NA NA 6 9
8 NA NA 4 6
9 NA NA 9 5
# My desired output would be something like this:
f1 <- function(x) {
df <- x %>%
set_names(nm = rep(c("newcol1", "newcol2"), 2))
bind_rows(df[, c(1, 2)], df[, c(3, 4)])
}
f1(df)
# A tibble: 6 x 2
newcol1 newcol2
<dbl> <dbl>
1 3 5
2 4 3
3 5 1
4 6 9
5 4 6
6 9 5
I can get the desired output without a pipe but first I would like to know how I could use bind_rows in a pipe without getting NA values and duplications and second whether I could use select function in bind_rows as I remember once Hadley Wickham used filter function wrapped by bind_rows.
I would appreciate any explanation to this problem and thank you in advance.
Select the first two columns and bind_rows col3 col4 to col1 and col2 then use transmute
df1 <- df %>%
select(col1, col2) %>%
bind_rows(
df %>%
transmute(col1 = col3, col2 = col4)
)
Results:
# A tibble: 6 x 2
col1 col2
<dbl> <dbl>
1 3 5
2 4 3
3 5 1
4 6 9
5 4 6
6 9 5
Related
I have a df
df = data.frame(col1=1:4, col2 = 5:8, col3 = 9:12)
I want to change the value in row2, col2 to 44
In base R, I use df["2","col2"] = 44, how do I do this in tidyverse?
df = data.frame(col1=1:4, col2 = 5:8, col3 = 9:12)
df
df["2","col2"]=44
df
A possible solution:
library(dplyr)
df %>%
mutate(col2 = ifelse(row_number() == 2, 44, col2))
#> col1 col2 col3
#> 1 1 5 9
#> 2 2 44 10
#> 3 3 7 11
#> 4 4 8 12
You could maybe use the rows_update() function from dplyr?
rows_update(df, tibble(col1=2, col2=44))
# Matching, by = "col1"
# A tibble: 4 x 3
col1 col2 col3
<int> <int> <int>
1 1 5 9
2 2 44 10
3 3 7 11
4 4 8 12
I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C
I have a list with two dataframes, the first of which has two columns and the second of which has three.
dat.list<-list(dat1=data.frame(col1=c(1,2,3),
col2=c(10,20,30)),
dat2= data.frame(col1=c(5,6,7),
col2=c(30,40,50),
col3=c(7,8,9)))
# $dat1
# col1 col2
# 1 1 10
# 2 2 20
# 3 3 30
# $dat2
# col1 col2 col3
# 1 5 30 7
# 2 6 40 8
# 3 7 50 9
I am trying to create a new column in both dataframes using map(), mutate() and case_when(). I want this new column to be identical to col3 if the dataframe has more than two columns, and identical to col1 if it has two or less columns. I have tried to do this with the following code:
library(tidyverse)
dat.list %>% map(~ .x %>%
mutate(newcol=case_when(ncol(.)>2 ~ col3,
TRUE ~ col1),
))
However, this returns the following error: "object 'col3' not found". How can I get the desired output? Below is the exact output I am trying to achieve.
# $dat1
# col1 col2 newcol
# 1 1 10 1
# 2 2 20 2
# 3 3 30 3
# $dat2
# col1 col2 col3 newcol
# 1 5 30 7 7
# 2 6 40 8 8
# 3 7 50 9 9
if/else will do :
library(dplyr)
library(purrr)
dat.list %>% map(~ .x %>% mutate(newcol= if(ncol(.) > 2) col3 else col1))
#$dat1
# col1 col2 newcol
#1 1 10 1
#2 2 20 2
#3 3 30 3
#$dat2
# col1 col2 col3 newcol
#1 5 30 7 7
#2 6 40 8 8
#3 7 50 9 9
Base R using lapply :
lapply(dat.list, function(x) transform(x, newcol = if(ncol(x) > 2) col3 else col1))
I have three dataframes like below:
df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','B','C','E','I'),col2=c(4,6,8,2,9))
df1 <- data.frame(col1=c('A','D','C','E','I'),col2=c(4,7,8,2,9))
The differences between any two files could be as below:
anti_join(df2, df3)
# Joining, by = c("col1", "col2")
# col1 col2
# 1 B 6
# 2 I 9
anti_join(df3, df2)
# Joining, by = c("col1", "col2")
# [1] col1 col2
# <0 rows> (or 0-length row.names)
anti_join(df1, df2)
# Joining, by = c("col1", "col2")
# col1 col2
# 1 D 7
anti_join(df2, df1)
# Joining, by = c("col1", "col2")
# col1 col2
# 1 B 6
I would like to create a master dataframe with all the values in col1 and col2 specific to each dataframe. If there is no such value present, it should populate NA.
col1 df1_col2 df2_col2 df3_col2
1 A 4 4 4
2 B NA 6 NA
3 C 8 8 8
4 E 2 2 2
5 I 9 9 NA
6 D 7 NA NA
The essence of the above output could be established from the above anti_join commands. However, it does not provide the complete picture at once. Any thoughts on how to achieve this?
Edit: For multiple values in col2 for col1, the output is a little messier. For example, A has values 4, 3.
df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','A','B','C','E','I'),col2=c(4,3,6,8,2,9))
df1 <- data.frame(col1=c('A','A','D','C','E','I'),col2=c(4,3,7,8,2,9))
lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
lst_of_frames %>%
imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
reduce(full_join, by = "col1")
It gives the below output.
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 A 4 3 4
# 3 A 3 4 4
# 4 A 3 3 4
# 5 D 7 NA NA
# 6 C 8 8 8
# 7 E 2 2 2
# 8 I 9 9 NA
# 9 B NA 6 NA
The interesting part of the output is:
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 A 4 3 4
# 3 A 3 4 4
# 4 A 3 3 4
whereas the expected output is:
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 A 3 3 NA
You may use the full_join function from the dplyr package.
df_master <- df1 %>%
full_join(df2, by = "col1") %>%
full_join(df3, by = "col1") %>%
select(col1, df1_col2 = col2.x,
df2_col2 = col2.y,
df3_col2 = col2)
col1 df1_col2 df2_col2 df3_col2
1 A 4 4 4
2 D 7 NA NA
3 C 8 8 8
4 E 2 2 2
5 I 9 9 NA
6 B NA 6 NA
Similar to #tamtam's answer, but a little programmatic if you have a dynamic list of frames.
lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
# lst_of_frames <- tibble::lst(df1, df2, df3) # thanks, #user63230
library(dplyr)
library(purrr) # imap, reduce
lst_of_frames %>%
imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
reduce(full_join, by = "col1")
# col1 df1_col2 df2_col2 df3_col2
# 1 A 4 4 4
# 2 D 7 NA NA
# 3 C 8 8 8
# 4 E 2 2 2
# 5 I 9 9 NA
# 6 B NA 6 NA
It's important (for automatically renaming the columns) that the list-of-frames be a named list; my assumption was the name of the frame variable list(df1=df1), but it could just as easily be list(A=df1) to produce a column named A_col2 in the end.
I'm merging two data frames as follows:
data_merged <- full_join(df1, df2, by=c("col1","col2")) %>%
fill(everything(), .direction = 'down')
However, there is a column in the new merged data frame that I don't want to fill (say, col3). This row needs to retain its NA value. I've tried doing this with select but failed, and also thought of maybe working around with making part of it a tibble but can't capitalize on the idea.
Does anybody have any ideas?
Try this:
data.frame(col1 = 1:10, col2 = c(1, NA), col3 = c(2,NA))%>%
fill(!col3, .direction = 'down')
# col1 col2 col3
# 1 1 1 2
# 2 2 1 NA
# 3 3 1 2
# 4 4 1 NA
# 5 5 1 2
# 6 6 1 NA
# 7 7 1 2
# 8 8 1 NA
# 9 9 1 2
# 10 10 1 NA
We can also use na.locf from zoo
library(zoo)
df1$col3 <- na.locf0(df1$col3)
data
df1 <- data.frame(col1 = 1:10, col2 = c(1, NA), col3 = c(2,NA))