R Join without duplicates - r

Currently when joining two datasets (of different years) I get duplicates of the second one when there are less observations in the second one than the first.
Below, ID 1 only has 1 observation in year y, but it gets repeated because the first dataset of year x has three observations in total. I don't want the duplicates, but simply NAs.
So what I currently get is this:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 A 2
3 1 C 1 A 2
What I want is:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 NA NA
3 1 C 1 NA NA
The end result is that my manager can tell in year x customer 1 ordered A, B, C in n.x quantities. In year y they only ordered A in n.y quantities.
Data:
structure(list(ID = c(1, 1, 1), Value = c("A", "B", "C"), N = c(6,
7, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
structure(list(ID = 1, Value = "A", N = 2), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -1L))

I would do it like this:
merge(tbl_df1, tbl_df2, by = c("ID", "Value"), all = TRUE)

Related

r: combine two separate values on same row based on duplicate values in a column [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed last month.
Let's say I have
# A tibble: 4 × 3
Gene.names Case Control
<chr> <dbl> <dbl>
1 A1BG 52 NA
2 A1BG NA 32
3 A2M 16 NA
4 A2M NA 15
As you can see, Gene.names are duplicates and have corresponding values for Case and Control. I need to combine the values for Case and Control so they are printed on the same row for each Gene.name.
I am looking for a solution in dplyr.
Expected output
Gene.names Case Control
<chr> <dbl> <dbl>
1 A1BG 52 32
2 A2M 16 15
Data
df <- structure(list(Gene.names = c("A1BG", "A1BG", "A2M", "A2M"),
Case = c(52, NA, 16, NA), Control = c(NA, 32, NA, 15)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
A combination of pivot_longer and picot_wider will do this.
library(tidyverse)
df <- structure(list(
Gene.names = c("A1BG", "A1BG", "A2M", "A2M"),
Case = c(52, NA, 16, NA), Control = c(NA, 32, NA, 15)
), row.names = c(
NA,
-4L
), class = c("tbl_df", "tbl", "data.frame"))
df |>
pivot_longer(cols = Case:Control) |>
filter(!is.na(value)) |>
pivot_wider(names_from = name, values_from = value)
#> # A tibble: 2 × 3
#> Gene.names Case Control
#> <chr> <dbl> <dbl>
#> 1 A1BG 52 32
#> 2 A2M 16 15

Retaining all columns in `tidyr::pivot_wider()` output

I am trying to convert a data in long format to wide format using tidyr::pivot_wider() but running into problems.
Data
Let's say this is my example dataset
library(dplyr)
library(tidyr)
(dataEx <- structure(
list(
random1 = c(10, 10, 10, 10, 10, 10),
random2 = c(1, 1, 2, 2, 3, 3),
.rowid = c(1L, 1L, 2L, 2L, 3L, 3L),
Variable = c("x", "y", "x", "y", "x", "y"),
Dimension = c("Time", "Fraction", "Time", "Fraction", "Time", "Fraction"),
Unit = c("s", "%", "s", "%", "s", "%"),
Values = c(900, 25, 1800, 45, 3600, 78)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
))
#> # A tibble: 6 x 7
#> random1 random2 .rowid Variable Dimension Unit Values
#> <dbl> <dbl> <int> <chr> <chr> <chr> <dbl>
#> 1 10 1 1 x Time s 900
#> 2 10 1 1 y Fraction % 25
#> 3 10 2 2 x Time s 1800
#> 4 10 2 2 y Fraction % 45
#> 5 10 3 3 x Time s 3600
#> 6 10 3 3 y Fraction % 78
Actual output
And here is what I currently have to pivot it to wider format. And, although it works, note that it drops two columns: random1 and random2.
dataEx %>%
tidyr::pivot_wider(
id_cols = .rowid,
names_from = Variable,
values_from = dplyr::matches("Values|Unit|Dimension"),
names_glue = "{Variable}{.value}"
)
#> # A tibble: 3 x 7
#> .rowid xDimension yDimension xUnit yUnit xValues yValues
#> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 1 Time Fraction s % 900 25
#> 2 2 Time Fraction s % 1800 45
#> 3 3 Time Fraction s % 3600 78
Expected output
How can I avoid this from happening, so that I get the following (expected) output?
#> # A tibble: 3 x 9
#> .rowid xDimension yDimension xUnit yUnit xValues yValues random1 random2
#> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Time Fraction s % 900 25 10 1
#> 2 2 Time Fraction s % 1800 45 10 2
#> 3 3 Time Fraction s % 3600 78 10 3
Add more column names to the id_cols argument:
id_cols = c(.rowid, random1, random2)

Organize columns with numerical info in colnames via dplyr relocate

I have a large amount of annual data in a data frame that will only get larger. I would like to organize it, grouping columns according to the year, which is included in the column names.
Base data:
dput(dat)
structure(list(id = 1:2, quantity = 3:4, avg_2002 = 5:6, avg_2003 = 7:8,
avg_2020 = 9:10, rev_2002 = c(15L, 24L), rev_2003 = c(21L,
32L), rev_2020 = c(27L, 40L)), row.names = c(NA, -2L), class = "data.frame")
What I would like to do is have all of the columns with, say, "2002" in them organized together, followed by the "2003" columns and so on...I know that relocate in dplyr is a good way to do it so I did the following:
dat <- tibble(dat)
dat <- dat %>%
relocate(grep("2002$", colnames(dat), value = TRUE),
.before = grep("2003$", colnames(dat), value = TRUE)) %>%
relocate(grep("2003$", colnames(dat), value = TRUE),
.after = grep("2002$", colnames(dat), value = TRUE))
which produces the desired result for my toy dataset:
id quantity avg_2002 rev_2002 avg_2003 rev_2003 avg_2020 rev_2020
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 3 5 15 7 21 9 27
2 2 4 6 24 8 32 10 40
My question is this:
How do I generalize the code above so that I don't have to keep adding relocate statements ad nauseum?
Is there a better way to do this task without using dplyr::relocate?
Any suggestions are much appreciated. Thanks!
We may use select - extract the numeric part of the column names, order it and use that index in select to reorder
library(dplyr)
dat %>%
select(id, quantity, order(readr::parse_number(names(.)[-(1:2)])) + 2)
-output
# A tibble: 2 × 8
id quantity avg_2002 rev_2002 avg_2003 rev_2003 avg_2020 rev_2020
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 3 5 15 7 21 9 27
2 2 4 6 24 8 32 10 40

Left_join fill NA entries with data values from the second dataframe

I have two fairly complicated data.frames and managed to simplify the first step of my problem here. I have a reference table and another that contains my data as follows:
REFERENCE
ref <- structure(list(group = c("A", "B", "C"), position = c("a", "a",
"b")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
DATA
df <- structure(list(position = c("a", "a"), value = c(1, 1, 2), name = c("foo",
"bar")), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
I used left_join(ref,df,by="position") %>% arrange(name) to obtain:
1 A a 1 foo
2 A a 1 bar
3 B a 1 foo
4 B a 1 bar
5 C b NA NA
The ideal output however is:
group position value name
<chr> <chr> <dbl> <chr>
1 A a 1 bar
2 B a 1 bar
3 C b 0 bar
4 A a 1 foo
5 B a 1 foo
6 C b 0 foo
I would like the name column to replace NA with the input from df and the value column's NA with 0. In the real df, I have more than foo in the name column
We could use crossing to get the combinations, then replace the 'value' column values to 0 where the 'position' columns are not equal
library(dplyr)
library(tidyr)
crossing(ref, df %>%
rename(position2 = position)) %>%
arrange(name) %>%
mutate(value = replace(value, position != position2 , 0)) %>%
select(-position2)
# A tibble: 6 x 4
# group position value name
# <chr> <chr> <dbl> <chr>
#1 A a 1 bar
#2 B a 1 bar
#3 C b 0 bar
#4 A a 1 foo
#5 B a 1 foo
#6 C b 0 foo

Reorder, exclude a column and keep others in R?

Here is my toy dataframe:
structure(list(a = c(1, 2), b = c(3, 4), c = c(5, 6), d = c(7,
8)), .Names = c("a", "b", "c", "d"), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Now I want to reorder and exclude one the columns and keep the others:
df %>% select(-a, d, everything())
I want my df to be :
d b c
7 3 5
8 4 6
I get the following:
b c d a
<dbl> <dbl> <dbl> <dbl>
1 3 5 7 1
2 4 6 8 2
Keep the -a at the last in the select. Even though, we removed a in the beginning the everythig() at the end is still checking the column names of the whole dataset
df%>%
select(d, everything(), -a)
# A tibble: 2 x 3
# d b c
# <dbl> <dbl> <dbl>
#1 7 3 5
#2 8 4 6

Resources