Combine two data frames across multiple columns - r

Say I have two dataframes, each with four columns. One column is a numeric value. The other three are identifying variables. For example:
set1 <- data.frame(label1 = c("a","b", "c"), label2 = c("red", "white", "blue"), name = c("sam", "bob", "drew"), val = c(1, 10, 100))
set2 <- data.frame(label1 = c("b","c", "d"), label2 = c("white", "green", "orange"), name = c("bob", "drew", "collin"), val = c(7, 100, 15))
Which are:
> set1
label1 label2 name val
1 a red sam 1
2 b white bob 10
3 c blue drew 50
> set2
label1 label2 name val
1 b white bob 7
2 c green drew 100
3 d orange collin 15
The first three columns can be combined to form a primary key. What is the most efficient way to combine these two data frames such that all unique values (from columns label1, label2, name) are displayed along with the two val columns:
set3 <- data.frame(label = c("a", "b", "c", "c", "d"), label2 = c("red", "white", "blue", "green", "orange"), name = c("sam", "bob", "drew", "drew", "collin"), val.set1 = c(1, 10, 50, NA, NA), val.set2 = c(NA, 7, NA, 100, 15))
> set3
label label2 name val.set1 val.set2
1 a red sam 1 NA
2 b white bob 10 7
3 c blue drew 50 NA
4 c green drew NA 100
5 d orange collin NA 15
>

When thinking of efficiency, you should evaluate the data.table package:
library(data.table)
(merge(
setDT(set1, key=names(set1)[1:3]),
setDT(set2, key=names(set2)[1:3]),
all=T,
suffixes=paste0(".set",1:2)
) -> set3)
# label1 label2 name val.set1 val.set2
# 1: a red sam 1 NA
# 2: b white bob 10 7
# 3: c blue drew 100 NA
# 4: c green drew NA 100
# 5: d orange collin NA 15

Since they're in the same format, you could just rowbind them together and then take only the unique values. Using dplyr:
bind_rows(set1, set2) %>% distinct(label1, label2, name)
You just want to make sure that you don't have factors in there, that everything is a character or numeric.

Related

How do I plot data points that are very close together in R?

Here's a sample of the three dataframe I'm working with. The full dataset contains 1,087 rows.
Day Length Category
1 1 33.807 Red
2 2 33.909 Red
3 3 34.011 Red
4 4 34.556 Red
5 5 34.789 Red
5 6 35 Red
Day Length Category
1 1 33.737 Blue
2 2 33.898 Blue
3 3 34.211 Blue
4 4 34.657 Blue
5 5 34.714 Blue
5 6 34.912 Blue
Day Length Category
1 1 33.631 Green
2 2 33.777 Green
3 3 34.101 Green
4 4 34.244 Green
5 5 34.590 Green
5 6 34.128 Green
My current code is as follows:
ggplot(data = df, aes(x = Day, y = Length, group = Category)) + geom_line(aes(color = Category, alpha = 1), size = 2)
But this results in three lines that are overlapping. Is there a better solution for this? Again, this dataset is a sample and the full dataset is much larger. So a solution that would work for a dataset of any size would be appreciated!
If you want to focus on the difference, plot the difference:
dd = data.frame(Day = d1$Day, diff = d1$Length - d2$Length)
library(ggplot2)
ggplot(dd, aes(x = Day, y = diff)) +
geom_hline(yintercept = 0, lwd = 1) +
geom_line() +
geom_point() +
labs(title = "Difference (first - second)")
Using this data:
d1 = read.table(text = ' Day Length
1 1 33.807
2 2 33.909
3 3 34.011
4 4 34.556
5 5 34.789
6 6 35', header = T)
d2 = read.table(text = ' Day Length
1 1 33.737
2 2 33.898
3 3 34.211
4 4 34.657
5 5 34.714
6 6 34.912', header = T)
The typical ggplot2 workflow would combine the two data sets into one table, with a column distinguishing the source, which you could then map to the color aesthetic, for instance. You might also add text labels if the distinction in the underlying data is important to show.
library(tidyverse)
tribble(
~Source, ~Day, ~Length,
"A", 1, 33.807,
"A", 2, 33.909,
"A", 3, 34.011,
"A", 4, 34.556,
"A", 5, 34.789,
"A", 6, 35,
"B", 1, 33.737,
"B", 2, 33.898,
"B", 3, 34.211,
"B", 4, 34.657,
"B", 5, 34.714,
"B", 6, 34.912) %>%
ggplot(aes(Day, Length, color = Source)) +
geom_line() +
ggrepel::geom_text_repel(aes(label = Length),
direction = "y", box.padding = 0.5)
Or, if your data is in two tables like the d1 and d2 in the answer from #gregor-thomas, you could use something like this to combine them:
bind_rows("A" = d1, "B" = d2, .id = "Source") %>%
ggplot(aes(Day, Length, color = Source)) +
geom_line() +
ggrepel::geom_text_repel(aes(label = Length),
direction = "y", box.padding = 0.5)
Edit:
If it's a visual readability issue, you might try variations like ggshadow::geom_shadowline to highlight overlaps:
devtools::install_github("marcmenem/ggshadow")
df %>%
ggplot(aes(Day, Length, color = Category)) +
ggshadow::geom_shadowline(size = 2)
Using this data:
df = read.table(text =
'Day Length Category
1 33.807 Red
2 33.909 Red
3 34.011 Red
4 34.556 Red
5 34.789 Red
6 35 Red
1 33.737 Blue
2 33.898 Blue
3 34.211 Blue
4 34.657 Blue
5 34.714 Blue
6 34.912 Blue
1 33.631 Green
2 33.777 Green
3 34.101 Green
4 34.244 Green
5 34.590 Green
6 34.128 Green', header = T)

R: outer-merge two dataframes with unequal columns

I'm new to coding and am struggling a bit with this merge. I have two dataframes:
> a1
a b c
1 1 apple x
2 2 bees a
3 3 candy a
4 4 dice s
5 5 donut d
> a2
a b c d
1 1 apple x a
2 2 bees a d
3 6 coffee r s
I would like to join these two dataframes by a, b, and c. I want to get rid of the duplicate rows where a, b, c are the same, but keep the unique rows in both datasets. In the case where a unique row in a2 is kept, I would also want d to be shown. So the result would be something like the following:
> a3
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a NA
4 4 dice s NA
5 5 donut d NA
6 6 coffee r s
You can use a full_join from tidyverse:
library(tidyverse)
full_join(a1, a2, by = c("a", "b", "c")) %>%
distinct()
Output
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a <NA>
4 4 dice s <NA>
5 5 donut d <NA>
6 6 coffee r s
Data
a1 <- structure(list(a = 1:5, b = c("apple", "bees", "candy", "dice",
"donut"), c = c("x", "a", "a", "s", "d")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
a2 <- structure(list(a = c(1L, 2L, 6L), b = c("apple", "bees", "coffee"
), c = c("x", "a", "r"), d = c("a", "d", "s")), class = "data.frame", row.names = c("1",
"2", "3"))
Hello there & here you go:
a3 <- merge(a1, a2, all=TRUE) # this merges your data preserving all vals from all cols/rows (if missing - adds NA)
a3 <- a3[order(a3[,"d"], decreasing=TRUE),] # this sorts your merged df
a3 <- a3[!duplicated(a3[,"b"]),] # this removes duplicate values
I edited my answer since I looked more carefully at your desired output. You wanna preserve "d" col value even tho there is a duplicate, so I first suggest to order your df based on the "d" column in decreasing fashion (so the NAs will be at the bottom of the df). Then I suggest to exclude the duplicates in col "b", and since the function preserves the first encountered value, the "apple" row taken initially from a2 df will be preserved in the output. Also, you can make an oneliner using pipe operator.

Transforming a dataframe in r to apply pivot table

I have a data frame like below:
Red Green Black
John A B C
Sean A D C
Tim B C C
How can I transform it to below form to apply a pivot table (or if it can be done directly in r without transforming data):
Names Code Type
John Red A
John Green B
John Black C
Sean Red A
Sean Green D
Sean Black C
Tim Red B
Tim Green C
Tim Black C
So then my ultimate goal is to count the types as below by a pivot table on the transformed dataframe:
Count of Code for each type:
Row Labels A B C D Grand Total
John 1 1 1 3
Sean 1 1 1 3
Tim 1 2 3
Grand Total 2 2 4 1 9
```
reading similar topics did not help that much.
Thanks in advance!
Regards
Using a literal dump from your first matrix-like frame above:
dat <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
I can do this:
library(dplyr)
library(tidyr)
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names)
# Names Code Type
# 1 John Red A
# 2 Sean Red A
# 3 Tim Red B
# 4 John Green B
# 5 Sean Green D
# 6 Tim Green C
# 7 John Black C
# 8 Sean Black C
# 9 Tim Black C
We can extend that to get your next goal:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .)
# Type
# Names A B C D
# John 1 1 1 0
# Sean 1 0 1 1
# Tim 0 1 2 0
which then just needs marginals:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .) %>%
addmargins()
# Type
# Names A B C D Sum
# John 1 1 1 0 3
# Sean 1 0 1 1 3
# Tim 0 1 2 0 3
# Sum 2 2 4 1 9
You can use reshape(). I'm not sure about your data structure, if there is a column with names or if they are row names. I've added both versions.
reshape(dat1, idvar="Names",
varying=2:4,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
# V1 Code Type
# 1 John red A
# 2 Sean red A
# 3 Tim red B
# 4 John black B
# 5 Sean black D
# 6 Tim black C
# 7 John green C
# 8 Sean green C
# 9 Tim green C
To get kind of a raw version you could do:
res <- reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code")
res
# Names Code Type
# John.1 John 1 A
# Sean.1 Sean 1 A
# Tim.1 Tim 1 B
# John.2 John 2 B
# Sean.2 Sean 2 D
# Tim.2 Tim 2 C
# John.3 John 3 C
# Sean.3 Sean 3 C
# Tim.3 Tim 3 C
After that you may assign labels at will to "Code" column by transforming to factor like so:
res$Code <- factor(res$Code, labels=c("red", "green", "black"))
Data
dat1 <- structure(list(Names = c("John", "Sean", "Tim"), Red = c("A",
"A", "B"), Green = c("B", "D", "C"), Black = c("C", "C", "C")), row.names = c(NA,
-3L), class = "data.frame")
dat2 <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), row.names = c("John", "Sean", "Tim"
), class = "data.frame")
What you aim to do is (1) creating a contingency table and then (2) compute the sum of table entries for both rows and columns.
Step1: Create a contingency table
I first pivoted the data using pivot_longer() rather than gather() because it's more intuitive. Then, apply table() to the two variables of your interest.
# Toy example
df <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
# Pivot the data
long_df <- tibble::rownames_to_column(df, var = "Names") %>%
tidyverse::pivot_longer(cols = c(-Names),
names_to = "Type",
values_to = "Code")
# Create a contingency table
df_table <- table(long_df$Names, long_df$Code)
Step 2: Compute the sum of entries for both rows and columns.
Again, I only used a base R function margin.table(). Using this approach also allows you to save the sum of the row and column entries for further analysis.
# Grand total (margin = 1 indicates rows)
df_table %>%
margin.table(margin = 1)
# Grand total (margin = 2 indicates columns)
df_table %>%
margin.table(margin = 2)

Restructuring data from long to wide by removing characters

Here is a sample of my data
code group type outcome
11 A red M*P
11 N orange N*P
11 Z red R
12 AB A blue Z*P
12 AN B green Q*P
12 AA A gray AB
which can be created by:
df <- data.frame(
code = c(rep(11,3), rep(12,3)),
group = c("A", "N", "Z", "AB A", "AN B", "AA A"),
type = c("red", "orange", "red", "blue", "green", "gray"),
outcome = c("M*P", "N*P", "R", "Z*P", "Q*P", "AB"),
stringsAsFactors = FALSE
)
I want to get the following table
code group1 group2 group3 type1 type2 type3 outcome
11 A N Z red orange red MNR
12 AB A AN B AA A blue green gray ZQAB
I have used the following code, but it does not work. I want to remove Ps in outcome. Thanks for your help.
dcast(df, formula= code +group ~ type, value.var = 'outcome')
Using data.table to hit your expected output:
library(data.table)
setDT(df)
# Clean out the Ps before hand
df[, outcome := gsub("*P", "", outcome, fixed = TRUE)]
# dcast but lets leave the outcome for later... (easier)
wdf <- dcast(df, code ~ rowid(code), value.var = c('group', 'type'))
# Now outcome maneuvering separately by code and merge
merge(wdf, df[, .(outcome = paste(outcome, collapse = "")), code])
code group_1 group_2 group_3 type_1 type_2 type_3 outcome
1: 11 A N Z red orange red MNR
2: 12 AB A AN B AA A blue green gray ZQAB

r programming: align two sequences of words

I want to align two datasets that mostly intersect on one column -- but each dataset is missing some rows. For example:
df1 <- data.frame(word = c("my", "dog", "ran", "with", "your", "dog"),
freq = c(5, 2, 2, 6, 5, 10))
df2 <- data.frame(word = c("my", "brown", "dog", "ran", "your", "dog"),
pos = c("a", "b", "c", "d", "a", "e"))
What I want as output is to have gaps inserted wherever there's a missing item. Thus in the output, the new form of df1 will have NAs where df1 was missing a word match that was in df2, and the new form of df2 will have NAs where df2 was missing a word-instance that was in df1.
As in my example, the sequence matters and elements do repeat. (so this isn't a generic "merge" situation.) I suspect DTW could figure in to the solution but I'm not sure. For present purposes it's fair to stipulate that only exact matches do match.
For the above case the desired output would be a data frame with these columns:
$word1 my NA dog ran with your dog
$freq 5 NA 2 2 6 5 2
$word2 my brown dog ran NA your dog
$pos a b c d NA a c
Thus, the sequence in each original data frame is maintained; nothing is deleted; word tokens remain tokens (it's a corpus, not a dictionary); all that's really happened is spaces (NAs) have been inserted where data are missing.
df1$count = ave(seq_along(df1$word), df1$word, FUN = seq_along)
df2$count = ave(seq_along(df2$word), df2$word, FUN = seq_along)
df1$merge = paste(df1$count, df1$word)
df2$merge = paste(df2$count, df2$word)
output = merge(x = df1, y = df2, by = "merge", all.x = TRUE, all.y = TRUE)
output[c(2, 3, 5, 6)]
# word.x freq word.y pos
#1 <NA> NA brown b
#2 dog 2 dog c
#3 my 5 my a
#4 ran 2 ran d
#5 with 6 <NA> <NA>
#6 your 5 your a
#7 dog 2 dog c

Resources