Merging data frames of different lengths without unique keys - r

I am trying to merge two data frames of different lengths without using a unique key.
For example:
Name <- c("Steve","Peter")
Age <- c(10,20)
df1 <- data.frame(Name,Age)
> df1
Name Age
1 Steve 10
2 Peter 20
Name <-c("Jason","Nelson")
School <-c("xyz","abc")
df2 <- data.frame(Name,School)
> df2
Name School
1 Jason xyz
2 Nelson abc
I want to join these two tables so that I have all columns and have NA cells for rows that didn't have that column originally. It should look something like this:
Name Age School
1 Steve 10 <NA>
2 Peter 20 <NA>
3 Jason NA xyz
4 Nelson NA abc
thank you in advance!

dplyr::bind_rows(df1,df2)
# Warning in bind_rows_(x, .id) :
# Unequal factor levels: coercing to character
# Warning in bind_rows_(x, .id) :
# binding character and factor vector, coercing into character vector
# Warning in bind_rows_(x, .id) :
# binding character and factor vector, coercing into character vector
# Name Age School
# 1 Steve 10 <NA>
# 2 Peter 20 <NA>
# 3 Jason NA xyz
# 4 Nelson NA abc
You can alleviate some of this by pre-assigning unrecognized columns, which also works well with base R:
df2 <- cbind(df2, df1[NA,setdiff(names(df1), names(df2)),drop=FALSE])
df1 <- cbind(df1, df2[NA,setdiff(names(df2), names(df1)),drop=FALSE])
df1
# Name Age School
# NA Steve 10 <NA>
# NA.1 Peter 20 <NA>
df2
# Name School Age
# NA Jason xyz NA
# NA.1 Nelson abc NA
# ensure we use the same column order for both frames
nms <- names(df1)
rbind(df1[,nms], df2[,nms])
# Name Age School
# NA Steve 10 <NA>
# NA.1 Peter 20 <NA>
# NA1 Jason NA xyz
# NA.11 Nelson NA abc

Related

data.table::dcast long to wide data while ignoring NA-Category?

I want to transform my data from long to wide after some joins, resulting in a few NAs in the data provided.
Unfortunately, these NAs are also present in the richt-hand side (RHS), which defines the newly added columns via the transformation.
Consider this example:
library(data.table)
dt <- data.table(id=c(1,2,1,2,3,4),
group = c("A","A","B","B",NA,NA),
values = c(7,8,9,10,NA,NA))
dt_wide <- dcast(dt,
id ~ group,
value.var = c("values"))
In the data, rows 5 and 6 do not have any group or associated value:
id group values
1: 1 A 7
2: 2 A 8
3: 1 B 9
4: 2 B 10
5: 3 <NA> NA
6: 4 <NA> NA
if there is an associated value, a group does exist, therefore: (group == NA) => (value == NA)
the transformed dataframe wrongly considers NA as its own group in the group- column, which results in the following wide data table:
id NA A B
1: 1 NA 7 9
2: 2 NA 8 10
3: 3 NA NA NA
4: 4 NA NA NA
I would not prefer to build a possible buggy workaround where i retroactively delete the NA column by name or values (as it might handle different colnames and columns later in production).
Is there a way to tell dcast to ignore the NAs in group and not make an extra column out of it, while preserving all rows in the transformed table?
Like this:
id A B
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA
This is tricky, but seems to work:
dcast(dt,
id ~ ifelse(is.na(group),unique(na.omit(dt$group)),group),
value.var = c("values"))
Key: <id>
id A B
<num> <num> <num>
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA
I don't think it's possible to prevent dcast from doing that. I'd just filter them out afterwards:
dt_wide[, names(dt_wide) != "NA", with = FALSE]
Output:
id A B
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA

Convert a single column into multiple columns based on delimiter in R

I have the following dataframe:
ID Parts
-- -----
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
And I would like the convert the Parts column into multiple columns by the : delimiter. so it should look like:
ID A B X2 J4 C D G4 X6 ........
-- - - -- -- - - -- --
1 A B na na na na na na
2 na na X2 na na na na na
3 na na na J4 na na na na
4 A na na na C D G4 X6
where there I would not know the number of potential columns in advance.
I have met my match on this one - strsplit() by delim I can do but only with fixed number of entities in the Parts column
You can use a combination of tidyr::seperate, tidyr::pivot_wider, and tidyr::pivot_longer. First you can still use strsplit to determine the number of columns to split Parts into not the number of unique values (How it works):
library(dplyr)
library(tidyr)
library(stringr)
n_col <- max(stringr::str_count(df$Parts, ":")) + 1
df %>%
tidyr::separate(Parts, into = paste0("col", 1:n_col), sep = ":") %>%
dplyr::mutate(across(everything(), ~dplyr::na_if(., ""))) %>%
tidyr::pivot_longer(-ID) %>%
dplyr::select(-name) %>%
tidyr::drop_na() %>%
tidyr::pivot_wider(id_cols = ID,
names_from = value)
ID A B X2 J4 C D G4 X6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B NA NA NA NA NA NA
2 2 NA NA X2 NA NA NA NA NA
3 3 NA NA NA J4 NA NA NA NA
4 4 A NA NA NA C D G4 X6
How it works
You do not need to know the number of unique values with this code -- the pivots take care of that. What you do need to know is how many new columns Parts will be split into with seperate. That's easy to do by counting the number of delimiters and adding one with str_count. This way you have the appropriate number of columns to seperate Parts into by your delimiter.
This is because pivot_longer will create a two column dataframe with repeated ID and a column with the delimited values of Parts -- an ID, Parts pairing. Then when you use pivot_wider the columns are automatically created for each unique value of Parts and the value is retained within the column. This function automatically fills with NA where an ID and Parts combination is not found.
Try running this pipe by pipe to better understand if need be.
Data
lines <- "
ID Parts
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
"
df <- read.table(text = lines, header = T)
Could the seperate function from tidyr be what you are looking for?
https://tidyr.tidyverse.org/reference/separate.html
It might require some fancy regex implementation, but could potentially work.

How do I combine two identical dataframes together with some replacement of values in one column?

I have a simple concept issue here and would need some help!
Here is my code:
Rank <- c(1,2,3,4,5)
ID <- c("Jack", "Tom", "Chloe", "Mary", "Max")
df <- data.frame(ID, Rank)
rankrange <- subset(df, Rank >2)
for (i in 1:nrow(rankrange)){ #revaluing numbers 3,4,5 with NA
rankrange[i,2] <- "NA"
}
ID Rank
3 Chloe NA
4 Mary NA
5 Max NA
How do I combine df & rankrange such that the values for Chloe, Mary
and Max are replaced with NA in df?
I hope to get this result:
ID Rank
1 Jack 1
2 Tom 2
3 Chloe NA
4 Mary NA
5 Max NA
I think this can be achieved with a function but I do not know which! Thanks!
Edit
I'm trying to understand what happens in the event I have two identical dataframes yet there's a variation in values for one column. What function can I use to combine these two dataframes together such that the values are replaced?
What exactly you are hoping for is still a bit unclear but here is an example that fits your data:
rbind(
df[!df$ID %in% rankrange$ID, ],
rankrange
)
ID Rank
1 Jack 1
2 Tom 2
3 Chloe NA
4 Mary NA
5 Max NA
Is this what you are looking for?
df$Rank <- ifelse(df$Rank > 2, NA, df$Rank)
ID Rank
1 Jack 1
2 Tom 2
3 Chloe NA
4 Mary NA
5 Max NA

Complex join of longitudinal tables in R

I have ~16 .txt files that I need to turn into one, wide flat file. For each new file, time has passed, and some new variables are added. What I would like to do is append those new columns to the right side of the first table, joining by an identification variable. This gets complicated quickly, so here is an MRE:
library(dplyr)
id <- as.character(1:6)
first <- c("jeff", "jimmy", "andrew", "taj", "karl-anthony", "jamal")
last <- c("teague", "butler", "wiggins", "gibson", "towns", "crawford")
set.seed(1839)
a <- c(1:4, NA, NA)
b <- c(1:4, NA, NA)
c <- c(11:13, NA, 14, NA)
d <- c(11:13, NA, 14, NA)
e <- c(21, 22, NA, 24, NA, 26)
f <- c(21, 22, NA, 24, NA, 26)
Simulating the three different files:
df_1 <- data.frame(
id = id[c(1:3,5)],
first = first[c(1:3,5)],
last = last[c(1:3,5)],
a = a[c(1:3,5)],
b = b[c(1:3,5)]
)
df_2 <- data.frame(
id = id[c(1:3,5)],
first = first[c(1:3,5)],
last = last[c(1:3,5)],
c = c[c(1:3,5)],
d = d[c(1:3,5)]
)
df_3 <- data.frame(
id = id[c(1,2,4,6)],
first = first[c(1,2,4,6)],
last = last[c(1,2,4,6)],
e = e[c(1,2,4,6)],
f = f[c(1,2,4,6)]
)
df_goal <- data.frame(id, first, last, a, b, c, d, e, f)
df_goal is what I want, and here is what it looks like:
> df_goal
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 4 taj gibson 4 4 NA NA 24 24
5 5 karl-anthony towns NA NA 14 14 NA NA
6 6 jamal crawford NA NA NA NA 26 26
Note that these are very big files, and the columns are not always in the right order, so I cannot just say to join by keeping the first three columns.
If I do a full_join on all, I get the names repeated every time:
df_all <- df_1 %>%
full_join(df_2, by = "id") %>%
full_join(df_3, by = "id")
> df_all
id first.x last.x a b first.y last.y c d first last e f
1 1 jeff teague 1 1 jeff teague 11 11 jeff teague 21 21
2 2 jimmy butler 2 2 jimmy butler 12 12 jimmy butler 22 22
3 3 andrew wiggins 3 3 andrew wiggins 13 13 <NA> <NA> NA NA
4 5 karl-anthony towns NA NA karl-anthony towns 14 14 <NA> <NA> NA NA
5 4 <NA> <NA> NA NA <NA> <NA> NA NA taj gibson 24 24
6 6 <NA> <NA> NA NA <NA> <NA> NA NA jamal crawford 26 26
What I tried to do next. I wrote a for loop, and I got each data frame, selected just (a) the id column, and (b) columns whose names have not appeared in the df_all data frame yet, and (c) did a full_join:
dfs <- c("df_2", "df_3")
df_all1 <- df_1
for (i in dfs) {
df_all1 <- get(i)[!names(get(i)) %in% names(df_all1)[-1]] %>%
full_join(df_all1, .)
}
> df_all1
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 5 karl-anthony towns NA NA 14 14 NA NA
5 4 <NA> <NA> NA NA NA NA 24 24
6 6 <NA> <NA> NA NA NA NA 26 26
Note that this means the cases that did not appear in the first file are missing the names (these represent key demographic variables in my data). I also tried going through row-by-row and doing a column join if the id was already present, and then doing a bind_row if it was not. This code threw an error:
df_all2 <- df_1
for (i in dfs) {
for (k in 1:nrow(get(i))) {
if (get(i)[k, "id"] %in% df_all2$id) {
df_all2 <- get(i)[k, !names(get(i)) %in% names(df_all2)[-1]] %>%
left_join(df_all2, ., by = "id")
} else {
df_all2 <- bind_rows(
df_all2,
get(i)[k, !names(get(i)) %in% names(df_all2)[-1]]
)
}
}
}
There has got to be a way to do a join with only select columns, but fill in missing information if necessary. Again, I am working with lots of files with lots of columns, so I cannot assume that I know the position of any columns; it has to be done by the column names.
I have also thought about just including a new variable that is the date of the file, stacking them all on top of one another ("long" format), and then using tidyr::spread and tidyr::gather, but I haven't found a solution yet.
I am not wedded to the tidyverse (base or data.table would be great, even some way to do a SQL join in R) or even R; I am open to a Python solution using pandas, as well.
Short version: How do I join new columns to an existing data set—by identification number—and fill in information from not-new columns, but since the case is new, need to be filled in?
Possible solution, per Psidom:
df_all1 <- df_1
for (i in dfs) {
df_all1 <- get(i) %>%
full_join(
df_all1, .,
by = names(get(i))[names(get(i)) %in% names(df_all1)]
)
}
df_all1
Maybe a more efficient way to do this, though?
Using melt once you have a full_join df_all.
library(data.table)
df <- melt(setDT(df_all),
measure.vars = patterns("^first", "^last"))
df <- unique(df[,-c("id", "variable")])
df[!is.na(df$value1),]
a b c d e f value1 value2
1: 1 1 11 11 21 21 jeff teague
2: 2 2 12 12 22 22 jimmy butler
3: 3 3 13 13 NA NA andrew wiggins
4: NA NA 14 14 NA NA karl-anthony towns
5: NA NA NA NA 24 24 taj gibson
6: NA NA NA NA 26 26 jamal crawford
The most simple solution using dplyr is to omit the by parameter in the calls to full_join().
library(dplyr)
df_1 %>%
full_join(df_2) %>%
full_join(df_3)
Joining, by = c("id", "first", "last")
Joining, by = c("id", "first",
"last")
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 5 karl-anthony towns NA NA 14 14 NA NA
5 4 taj gibson NA NA NA NA 24 24
6 6 jamal crawford NA NA NA NA 26 26
Warning messages:
1: Column id joining factors with different levels, coercing to character vector
2: Column first joining factors with different levels, coercing to character vector
3: Column last joining factors with different levels, coercing to character vector
The documentation of the by parameter in ?full_join says: If NULL, the default, *_join() will do a natural join, using all variables with common names across the two tables.
So this is equivivalent to explicetely passing by = c("id", "first", "last") as proposed by Psidom.
If there are many data frames to join, the code below may save a lot of typing:
Reduce(full_join, list(df_1, df_2, df_3))
The result (inluding messages) is the same as above.

Remove duplicates making sure of NA values R

My data set(df) looks like,
ID Name Rating Score Ranking
1 abc 3 NA NA
1 abc 3 12 13
2 bcd 4 NA NA
2 bcd 4 19 20
I'm trying to remove duplicates which using
df <- df[!duplicated(df[1:2]),]
which gives,
ID Name Rating Score Ranking
1 abc 3 NA NA
2 bcd 4 NA NA
but I'm trying to get,
ID Name Rating Score Ranking
1 abc 3 12 13
2 bcd 4 19 20
How do I avoid rows containing NA's when removing duplicates at the same time, some help would be great, thanks.
First, push the NAs to last with na.last = T
df<-df[with(df, order(ID, Name, Score, Ranking),na.last = T),]
then do the removing of duplicated ones with fromLast = FALSE argument:
df <- df[!duplicated(df[1:2],fromLast = FALSE),]
Using dplyr
df <- df %>% filter(!duplicated(.[,1:2], fromLast = T))
You could just filter out the observations you don't want with which() and then use the unique() function:
a<-unique(c(which(df[,'Score']!="NA"), which(df[,'Ranking']!="NA")))
df2<-unique(df[a,])
> df2
ID Name Rating Score Ranking
2 1 abc 3 12 13
4 2 bcd 4 19 20

Resources