left_join two data frames and overwrite - r

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4

I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.

This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)

Related

Convert every n # of rows to columns and stack them in R?

I have a tab-delimited text file with a series of timestamped data. I've read it into R using read.delim() and it gives me all the data as characters in a single column. Example:
df <- data.frame(c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
colnames(df) <- "col1"
df
I want to convert every n # of rows (in this case 4) to columns and stack them without using a for loop. Desired result:
col1 <- c("2017","2018","2018")
col2 <- c("A","X","X")
col3 <- c("B","Y","B")
col4 <- c("C","Z","C")
df2 <- data.frame(col1, col2, col3, col4)
df2
I created a for loop, but it can't handle the millions of rows in my df. Should I convert to a matrix? Would converting to a list help? I tried as.matrix(read.table()) and unlist() but without success.
You could use tidyr to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.
Assuming you know there are 4 groups (n = 4) you could do something like the following with the help of the dplyr package.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C
Or, in base you could use the split function on the vector, and then use do.call to make the data frame
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C
Created on 2022-02-01 by the reprex package (v2.0.1)
The easiest way would be to create a matrix with matrix(ncol=x, byrow=TRUE), then convert back to data.frame. Should be quite fast too.
df |>
unlist() |>
matrix(ncol=4, byrow = TRUE) |>
as.data.frame() |>
setNames(paste0('col', 1:4))
col1 col2 col3 col4
1 2017 A B C
2 2018 X Y Z
3 2018 X B C

Find and keep duplicated items in each column in R

Is there any way I can use some like tidyverse's add_count() %>% filter() or distinct() or alternatively janitor's get_dupes() to find and keep the duplicated items of each column. No need to compare items of different columns with each other, each column needs to be considered separately.
data1 <-tribble(
~colA, ~colB,
"a", 1,
"b", 1,
"c", 2,
"c", 3
)
Expected Output would be
colA colB
c 1
You can try with map_dfc which will map over the columns and return a data frame by column binding the outputs
library(tidyverse)
data1 %>%
map_dfc(~.x[duplicated(.x)])
# A tibble: 1 x 2
colA colB
<chr> <dbl>
1 c 1
However this will result in unwanted behavior when each column has a different amount of duplicates due to recycling (when applying an operation to two vectors that requires them to be the same length - like column bind, R automatically repeats the shorter one, until it is long enough to match the longer one).
data1 <-tribble(
~colA, ~colB,
"a", 1,
"b", 1,
"c", 2,
"c", 3,
"d", 1,
)
data1 %>%
map_dfc( ~.x[duplicated(.x)])
# A tibble: 2 x 2
colA colB
<chr> <dbl>
1 c 1
2 c 1
here colA has been recycled to match the length of colB. In such a case you are better off returning a list with map
data1 %>%
map( ~.x[duplicated(.x)])
#output
$colA
[1] "c"
$colB
[1] 1 1
In baseR
dupicatedList <- lapply(data1, function(columnValues) {
unique(columnValues[duplicated(columnValues)])
})
A base R option
> list2DF(Map(function(x) x[duplicated(x)], data1))
colA colB
1 c 1

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

How to anonymize data without losing duplicates

I need to anonymize data containing clientnumbers. About half of them are duplicate values, as these clients appear more than once.
How can I anonymize in R so that duplicates are transformed into the same value?
Thanks in advance!
Suppose your data looks like this:
df <- data.frame(id = c("A", "B", "C", "A", "B", "C"), value = rnorm(6),
stringsAsFactors = FALSE)
df
#> id value
#> 1 A -0.8238857
#> 2 B -0.1553338
#> 3 C -0.6297834
#> 4 A -0.4616377
#> 5 B 0.1643057
#> 6 C -0.6719061
And your list of new ID strings (which can be created randomly - see footnote) looks like this:
newIds <- c("newId1", "newId2", "newId3")
Then you should first ensure that your id column is a factor:
df$id <- as.factor(df$id)
Then you should probably store the client IDs for safe lookup later
lookup <- data.frame(key = newIds, value = levels(df$id))
lookup
#> key value
#> 1 newId1 A
#> 2 newId2 B
#> 3 newId3 C
Now all you need to do is overwrite the factor levels:
levels(df$id) <- newIds
df
#> id value
#> 1 newId1 0.7241847
#> 2 newId2 0.4313706
#> 3 newId3 -0.8687062
#> 4 newId1 1.3464852
#> 5 newId2 0.6973432
#> 6 newId3 1.9872338
Note: If you want to create random strings for the ids you can do this:
sapply(seq_along(levels(df$id)), function(x) paste0(sample(LETTERS, 5), collapse = ""))
#> [1] "TWABF" "YSBUF" "WVQEY"
Created on 2020-03-02 by the reprex package (v0.3.0)

Reshape a dataframe in R with non numeric values

I have a data frame with non-numeric values with the following format:
DF1:
col1 col2
1 a b
2 a c
3 z y
4 z x
5 a d
6 m n
I need to convert it into this format,
DF2:
col1 col2 col3 col4
1 a b c d
2 z y x NA
3 m n NA NA
With col1 as the primary key (not sure if this is the right terminology in R), and the rest of the columns contain the elements associated with that key (as seen in DF1).
DF2 will include more columns compared to DF1 depending upon the number of elements associated with any key.
Some columns will have no value resulting from different number of elements associated with each key, represented as NA (as shown in DF2).
The column names could be anything.
I have tried to use the reshape(), melt() + cast(), even a generic for loop where I use cbind and try to delete the row.
It is part of a very big dataset with over 50 million rows. I might have to use cloud services for this task but that is a different discussion.
I am new to R so there might be some obvious solution which I am missing.
Any help would be much appreciated.
-Thanks
If this is a big dataset, we can use data.table
library(data.table)
setDT(DF1)[, i1:=paste0("col", seq_len(.N)+1L), col1]
dcast(DF1, col1~i1, value.var='col2')
# col1 col2 col3 col4
#1: a b c d
#2: m n NA NA
#3: z y x NA
Using dplyr and tidyr :
library(tidyr)
library(dplyr)
DF <- data_frame(col1 = c("a", "a", "z", "z", "a", "m"),
col2 = c("b", "c", "y", "x", "d", "n"))
# you need to another column as key value for spreading
DF %>%
group_by(col1) %>%
mutate(colname = paste0("col", 1:n() + 1)) %>%
spread(colname, col2)
#> Source: local data frame [3 x 4]
#> Groups: col1 [3]
#>
#> col1 col2 col3 col4
#> (chr) (chr) (chr) (chr)
#> 1 a b c d
#> 2 m n NA NA
#> 3 z y x NA

Resources