R: Merge Data While Retaining Values for One Dataset in Duplicates - r

I have two data sets, data1 and data2:
data1 <- data.frame(ID = 1:6,
A = c("a1", "a2", NA, "a4", "a5", NA),
B = c("b1", "b2", "b3", NA, "b5", NA),
stringsAsFactors = FALSE)
data1
ID A B
1 a1 b1
2 a2 b2
3 NA b3
4 a4 NA
5 a5 b5
6 NA NA
and
data2 <- data.frame(ID = 1:6,
A = c(NA, "a2", "a3", NA, "a5", "a6"),
B = c(NA, "b2.wrong", NA, "b4", "b5", "b6"),
stringsAsFactors = FALSE)
data2
ID A B
1 NA NA
2 a2 b2.wrong
3 a3 NA
4 NA b4
5 a5 b5
6 a6 b6
I would like to merge them by ID so that the resultant merged dataset, data.merged, populates fields form both datasets, but chooses values from data1 whenever there are possible values from both datasets.
I.e., I would like the final dataset, data.merge, to be:
ID A B
1 a1 b1
2 a2 b2
3 a3 b3
4 a4 b4
5 a5 b5
6 a6 b6
I have looked around, finding similar but not exact answers.

You can join the data and use coalesce to select the first non-NA value.
library(dplyr)
data1 %>%
inner_join(data2, by = 'ID') %>%
mutate(A = coalesce(A.x, A.y),
B = coalesce(B.x, B.y)) %>%
select(names(data1))
# ID A B
#1 1 a1 b1
#2 2 a2 b2
#3 3 a3 b3
#4 4 a4 b4
#5 5 a5 b5
#6 6 a6 b6
Or in base R comparing values with NA :
transform(merge(data1, data2, by = 'ID'),
A = ifelse(is.na(A.x), A.y, A.x),
B = ifelse(is.na(B.x), B.y, B.x))[names(data1)]

Related

in R4.1.2 How to remove duplicate cells in a row leaving only the first cell

How to remove a repeated duplicate cell in row, leaving only the first cell.
(Remove the 2nd A3)
V1 V2 V3
A1 NA C1
A2 NA C2
A3 A3 C3
A4 NA C4
A5 NA C5
A6 NA C6
A7 NA C7
A8 NA C8
my target
V1 V2 V3
A1 NA C1
A2 NA C2
A3 NA C3
A4 NA C4
A5 NA C5
A6 NA C6
A7 NA C7
A8 NA C8
for(x in nrow(dataset))
{
if(dataset[x,2]%in%dataset[ ,1])
dataset[x, 2]<-NA
}
Something like this I guess
It should work even if the A3 is not in the same row as the A3 in the first column
If you need the target is to remove the same values only in the same row then replace if statement by
if(dataset[x,2]==dataset[x ,1])
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
V1 = c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8"),
V2 = c(NA, NA, "A3", NA, NA, NA, NA, NA),
V3 = c("C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8")
)
df %>%
mutate(V2 = if_else(V1 == V2, NA_character_, V2))
#> V1 V2 V3
#> 1 A1 <NA> C1
#> 2 A2 <NA> C2
#> 3 A3 <NA> C3
#> 4 A4 <NA> C4
#> 5 A5 <NA> C5
#> 6 A6 <NA> C6
#> 7 A7 <NA> C7
#> 8 A8 <NA> C8
Using replace.
transform(df, V2=replace(V2, V2 %in% V1, NA))
# V1 V2 V3
# 1 A1 <NA> C1
# 2 A2 <NA> C2
# 3 A3 <NA> C3
# 4 A4 <NA> C4
# 5 A5 <NA> C5
# 6 A6 <NA> C6
# 7 A7 <NA> C7
# 8 A8 <NA> C8
Or %in% in Reduce.
df[Reduce(`%in%`, df[1:2]), 'V2'] <- NA
df
# V1 V2 V3
# 1 A1 <NA> C1
# 2 A2 <NA> C2
# 3 A3 <NA> C3
# 4 A4 <NA> C4
# 5 A5 <NA> C5
# 6 A6 <NA> C6
# 7 A7 <NA> C7
# 8 A8 <NA> C8
Data:
df <- structure(list(V1 = c("A1", "A2", "A3", "A4", "A5", "A6", "A7",
"A8"), V2 = c(NA, NA, "A3", NA, NA, NA, NA, NA), V3 = c("C1",
"C2", "C3", "C4", "C5", "C6", "C7", "C8")), row.names = c(NA,
-8L), class = "data.frame")

How to rank order dates in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have a dataframe for multiple products and different date ranges. I want to assign unique value to each date so that even if the starting dates are different for various products, I can group by the dates.
df
acc product date
a1 p1 d1
a1 p1 d2
a1 p1 d3
a1 p1 d4
a1 p2 d1
a1 p2 d2
a1 p2 d3
a1 p3 d3
a1 p3 d4
I want to arrange the dates so that there is a unique identifier each for d1, d2, d3 etc.
I used the following code to try this:
df <- df %>% group_by(acc, product) %>% mutate(t = row_number())
Output
df
acc product date t EXPECTED
a1 p1 d1 1 1
a1 p1 d2 2 2
a1 p1 d3 3 3
a1 p1 d4 4 4
a1 p2 d1 1 1
a1 p2 d2 2 2
a1 p2 d3 3 3
a1 p3 d3 1 3
a1 p3 d4 2 4
Any suggestions for this?
use dplyr::dense_rank()
df %>% mutate(new = dense_rank(date))
acc product date new
1 a1 p1 d1 1
2 a1 p1 d2 2
3 a1 p1 d3 3
4 a1 p1 d4 4
5 a1 p2 d1 1
6 a1 p2 d2 2
7 a1 p2 d3 3
8 a1 p3 d3 3
9 a1 p3 d4 4
If however, you want to restart ranks for each acc use group_by before the mutate statement.
dput used
df <- structure(list(acc = c("a1", "a1", "a1", "a1", "a1", "a1", "a1",
"a1", "a1"), product = c("p1", "p1", "p1", "p1", "p2", "p2",
"p2", "p3", "p3"), date = c("d1", "d2", "d3", "d4", "d1", "d2",
"d3", "d3", "d4")), class = "data.frame", row.names = c(NA, -9L
))

how to create new variables from one variable using two rules

I would appreciate any help to create new variables from one variable.
Specifically, I need help to simultaneously create one row per each ID and various columns of E, where each of the new columns of E, (that is, E1, E2, E3) contains the values of E for each row of ID. I tried doing this which melt followed by spread but I am getting the error:
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
Additionally, I tried the solutions discussed here and here but these did not work for my case because I need to be able to create row identifiers for rows (4, 1, 2), (7, 3, 5), and (9, 6, 8). That is, E for rows (4, 1, 2) should be named E1, E for rows (7, 3, 5) should be named E2, E for rows (9, 6, 8) should be named E3, and so on.
#data
dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1",
"a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1",
"b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948,
0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119,
0.690660491345867, 0.23378944873769)), class = c("data.table",
"data.frame"), row.names = c(NA, -9L))
#my attempt
A B ID E
1: a1 b2 3 0.6211421
2: a2 b2 4 0.7421095
3: a1 b2 3 0.3943915
4: a1 b1 1 0.4069439
5: a2 b2 4 0.7796073
6: a1 b2 3 0.5505793
7: a1 b1 1 0.3526222
8: a2 b2 4 0.6906605
9: a1 b1 1 0.2337894
aTempDF <- melt(dT, id.vars = c("A", "B", "ID")) )
A B ID variable value
1: a1 b2 3 E 0.6211421
2: a2 b2 4 E 0.7421095
3: a1 b2 3 E 0.3943915
4: a1 b1 1 E 0.4069439
5: a2 b2 4 E 0.7796073
6: a1 b2 3 E 0.5505793
7: a1 b1 1 E 0.3526222
8: a2 b2 4 E 0.6906605
9: a1 b1 1 E 0.2337894
aTempDF%>%spread(variable, value)
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
#expected output
A B ID E1 E2 E3
1: a1 b2 3 0.6211421 0.3943915 0.5505793
2: a2 b2 4 0.7421095 0.7796073 0.6906605
3: a1 b1 1 0.4069439 0.3526222 0.2337894
Thanks in advance for any help.
You can use dcast from data.table
library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
# A B ID E1 E2 E3
#1 a1 b1 1 0.4069439 0.3526222 0.2337894
#2 a1 b2 3 0.6211421 0.3943915 0.5505793
#3 a2 b2 4 0.7421095 0.7796073 0.6906605
You need to create the correct 'time variable' first which is what rowid(ID) does.
For those looking for a tidyverse solution:
library(tidyverse)
dT <- structure(
list(
A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", "a2", "a1"),
B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", "b2", "b1"),
ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"),
E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 0.40694392882818,
0.550579323666347, 0.352622183880119, 0.690660491345867, 0.23378944873769,
0.779607277916503)),
class = c("data.table",
"data.frame"),
row.names = c(NA, -9L))
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# Just so columns are "E1", "E2", etc.
mutate(rn = glue::glue("E{row_number()}")) %>%
ungroup() %>%
spread(rn, E) %>%
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234
As mentioned in the accepted answer, you need a "key" variable to spread on first. This is created using row_number() and glue where glue just gives you the proper E1, E2, etc. variable names.
The group_by piece just makes sure that the row numbers are with respect to A, B and ID.
EDIT for tidyr >= 1.0.0
The (not-so) new pivot_ functions supercede gather and spread and eliminate the need to glue the new variable names together in a mutate.
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# no longer need to glue (or paste) the names together but still need a row number
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = rn, values_from = E, names_glue = "E{.name}") %>% # names_glue argument allows for easy transforming of the new variable names
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234

Merging two data.tables with common ID but different Columns [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 4 years ago.
I try to combine two data.tables in R based on a common ID but varying columns and I also want to drop duplicate ID rows. My approach would be:
dt1 dt2
ID X1 Y1 Z1 ID X2 Y2 Z2
1 a1 a2 a3 1 A1 A2 A3
2 b1 b2 b3 2 B1 NA B3
3 c1 c2 NA 3 C1 C2 C3
4 d1 d2 d3 5 E1 E2 E3
6 f1 f2 f3 6 F1 F2 F3
Using rbind(dt1, dt2, fill = TRUE) gives me:
dt_merged
ID X1 Y1 Z1 X2 Y2 Z2
1 a1 a2 a3 NA NA NA
1 NA NA NA A1 A2 A3
2 b1 b2 b3 NA NA NA
2 NA NA NA B1 NA B3
3 c1 c2 NA NA NA NA
3 NA NA NA C1 C2 C3
4 d1 d2 d3 NA NA NA
5 NA NA NA E1 E2 E3
6 f1 f2 f3 NA NA NA
6 NA NA NA F1 F2 F3
My problem is now that I donĀ“t know how to merge the duplicate row IDs and fill in the NAs with the corresponding data from the duplicate ID rows. My desired output data.table would be:
ID X1 Y1 Z1 X2 Y2 Z2
1 a1 a2 a3 A1 A2 A3
2 b1 b2 b3 B1 NA B3
3 c1 c2 NA C1 C2 C3
4 d1 d2 d3 NA NA NA
5 NA NA NA E1 E2 E3
6 f1 f2 f3 F1 F2 F3
I hope my stated description is good enough to give you an overview of my problem. Any kind of help would be higly appreciated by me and excuse me for my foolish question but data.table wrangling gives me sometimes a very hard time.
Simply do a full join. It is very simple with the dplyr package.
(or the data.table package)
library(dplyr)
dt1 <- data.frame("ID" = c(1,2,3,4,6),
"X1" = c("a1", "b1", "c1", "d1", "f1"),
"Y1" = c("a2", "b2", "c2", "d2", "f2"),
"Z1" = c("a3", "b3", NA, "d3", "f3")
)
dt2 <- data.frame("ID" = c(1,2,3,5,6),
"X2" = c("A1", "B1", "C1", "E1", "F1"),
"Y2" = c("A2", NA, "C2", "E2", "F2"),
"Z2" = c("A3", "B3", "C3", "E3", "F3")
)
dt3 <- full_join(x = dt1, y = dt2, by = "ID") %>%
arrange(ID)
dt4 <- merge(dt1, dt2, by = "ID", all = TRUE)
dt3
dt4
Updated:
If you ever need to join more tables (as per OP's comment), just chain them:
dt5 <- data.frame("ID" = c(1,3,4,5,7),
"X3" = c("A1", "C1", "D1", "E1","G1"),
"Y3" = c(NA, "C2", "D2", "E2", "G2"),
"Z3" = c("A3","C3", "D3", "E3", NA)
)
dt6 <- full_join(x = dt1, y = dt2, by = "ID") %>%
full_join( x = ., y = dt5, by = "ID") %>%
arrange(ID)
dt6

Order by letters and numbers

I have a DF$vector which looks like this:
A10 A50
C1 C4
B1
A7
C3
B1 B4
I look for a way to order it as follows:
A10 A50
A7
B1 B4
B1
C1 C4
C3
I tried to use gsub :
vector[order(gsub("([A-Z]+)([0-9]+)", "\\1", vector),
as.numeric(gsub("([A-Z]+)([0-9]+)", "\\2", vector)))]
But it didnt return what i want.
Thank you for any suggestions.
We can use order from base R
df1[order(sub("\\d+", "", df1[,1]), as.numeric(sub("\\D+", "", df1[,1])), df1[,2] == ""),]
# A10 A50
#3 A7
#5 B1 B4
#2 B1
#1 C1 C4
#4 C3
data
df1 <-structure(list(A10 = c("C1", "B1", "A7", "C3", "B1"), A50 = c("C4",
"", "", "", "B4")), .Names = c("A10", "A50"), class = "data.frame",
row.names = c(NA, -5L))
In programming languages, the letters are considered to be increasing in terms of magnitude. Thus A is considered to be lessthan Betc. Thus to order the above, just use the code:
df1$r=rank(df1$A10,ties.method = "last")
df1[order(df1$r),-ncol(df1)]
A10 A50
3 A7
5 B1 B4
2 B1
1 C1 C4
4 C3

Resources