R: outer-merge two dataframes with unequal columns - r

I'm new to coding and am struggling a bit with this merge. I have two dataframes:
> a1
a b c
1 1 apple x
2 2 bees a
3 3 candy a
4 4 dice s
5 5 donut d
> a2
a b c d
1 1 apple x a
2 2 bees a d
3 6 coffee r s
I would like to join these two dataframes by a, b, and c. I want to get rid of the duplicate rows where a, b, c are the same, but keep the unique rows in both datasets. In the case where a unique row in a2 is kept, I would also want d to be shown. So the result would be something like the following:
> a3
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a NA
4 4 dice s NA
5 5 donut d NA
6 6 coffee r s

You can use a full_join from tidyverse:
library(tidyverse)
full_join(a1, a2, by = c("a", "b", "c")) %>%
distinct()
Output
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a <NA>
4 4 dice s <NA>
5 5 donut d <NA>
6 6 coffee r s
Data
a1 <- structure(list(a = 1:5, b = c("apple", "bees", "candy", "dice",
"donut"), c = c("x", "a", "a", "s", "d")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
a2 <- structure(list(a = c(1L, 2L, 6L), b = c("apple", "bees", "coffee"
), c = c("x", "a", "r"), d = c("a", "d", "s")), class = "data.frame", row.names = c("1",
"2", "3"))

Hello there & here you go:
a3 <- merge(a1, a2, all=TRUE) # this merges your data preserving all vals from all cols/rows (if missing - adds NA)
a3 <- a3[order(a3[,"d"], decreasing=TRUE),] # this sorts your merged df
a3 <- a3[!duplicated(a3[,"b"]),] # this removes duplicate values
I edited my answer since I looked more carefully at your desired output. You wanna preserve "d" col value even tho there is a duplicate, so I first suggest to order your df based on the "d" column in decreasing fashion (so the NAs will be at the bottom of the df). Then I suggest to exclude the duplicates in col "b", and since the function preserves the first encountered value, the "apple" row taken initially from a2 df will be preserved in the output. Also, you can make an oneliner using pipe operator.

Related

Convert all empty & fields marked with "N/A" as NA in R

I am new to Machine Learning & R, so my question is a pretty basic one:
I have imported a dataset and performed some modifications and stored the final output in a dataframe named df_final.
Now I would like to replace all the empty fields and fields with "N/A", "n/a" as NA, so that I could use the inbuilt na libraries in R.
Any help in this context would be highly appreciated.
Cheers!
Vivek
I agree that the problem is best solved at read-in, by setting na.strings = c("", "N/A", "n/a") in read.table, as suggested by #Darren Tsai. If that's no longer an option because you've processed the data already and, as I suspect, you do not want to keep only complete cases, as suggested by #Rui Barradas, then the issue can be addressed this way:
DATA:
df_final <- data.frame(v1 = c(1, "N/A", 2, "n/a", "", 3),
v2 = c("a", "", "b", "c", "d", "N/A"))
df_final
v1 v2
1 1 a
2 N/A
3 2 b
4 n/a c
5 d
6 3 N/A
SOLUTION:
To introduce NA into empty fields, you can do:
df_final[df_final==""] <- NA
df_final
v1 v2
1 1 a
2 N/A <NA>
3 2 b
4 n/a c
5 <NA> d
6 3 N/A
To change the other values into NA, you can use lapply and a function:
df_final[,1:2] <- lapply(df_final[,1:2], function(x) gsub("N/A|n/a", NA, x))
df_final
v1 v2
1 1 a
2 <NA> <NA>
3 2 b
4 <NA> c
5 <NA> d
6 3 <NA>
This is a two steps solution.
Replace the bad values by real NA values.
Keep the complete.cases.
In base R:
is.na(df1) <- sapply(df1, function(x) x %in% c("", "N/A", "n/a"))
df_final <- df1[complete.cases(df1), , drop = FALSE]
df_final
# x y
#1 a u
#3 d v
Data creation code.
df1 <- data.frame(x = c("a", "N/A", "d", "n/a", ""),
y = c("u", "", "v", "x", "y"))

Renaming the headers of every data frame in a list

I have a list containing a number of data frames, all with the same number of columns.
E.g, for a list df_list with two data frames, df1 and df2:
>df_list
df1
a b c
1 1 1
2 2 2
3 3 3
df2
a b c
3 2 1
3 2 1
3 2 1
I want to rename the headers of every data frame to new_headings <- c("A", "B", "C").
I constructed a for loop:
for (i in 1:length(list)) {
names(list[[i]]) <- new_headings
}
However, this doesn't work. The headings remain as they were. If I do it individually instead of in a loop, it works fine, however, e.g., names(list[[1]]) <- new_headings changes the headings appropriately.
My actual list is very long with many data frames. Can anyone explain why this isn't working or what other approach I can use? Thank you.
We can use Map with setNames
df_listNew <- Map(setNames, df_list, list(new_headings))
Or using lapply
lapply(df_list, setNames, new_headings)
#$df1
# A B C
#1 1 1 1
#2 2 2 2
#3 3 3 3
#$df2
# A B C
#1 3 2 1
#2 3 2 1
#3 3 2 1
data
df_list <- list(df1 = structure(list(a = 1:3, b = 1:3, c = 1:3),
class = "data.frame", row.names = c(NA,
-3L)), df2 = structure(list(a = c(3, 3, 3), b = c(2, 2, 2), c = c(1,
1, 1)), class = "data.frame", row.names = c(NA, -3L)))
You can use two for loops
a<-c(1,2,3)
b<-c(1,2,3)
c<-c(1,2,3)
df1<-as.data.frame(cbind(a,b,c))
a<-c(3,2,1)
b<-c(3,2,1)
c<-c(3,2,1)
df2<-as.data.frame(cbind(a,b,c))
df_list<-list(df1,df2)
new_headings <- c("A", "B", "C")
for (i in 1:length(df_list)) {
for (j in 1:length(df_list[[i]])) {
colnames(df_list[[i]])[j] <- new_headings[j]
}
}
df_list

Transforming a dataframe in r to apply pivot table

I have a data frame like below:
Red Green Black
John A B C
Sean A D C
Tim B C C
How can I transform it to below form to apply a pivot table (or if it can be done directly in r without transforming data):
Names Code Type
John Red A
John Green B
John Black C
Sean Red A
Sean Green D
Sean Black C
Tim Red B
Tim Green C
Tim Black C
So then my ultimate goal is to count the types as below by a pivot table on the transformed dataframe:
Count of Code for each type:
Row Labels A B C D Grand Total
John 1 1 1 3
Sean 1 1 1 3
Tim 1 2 3
Grand Total 2 2 4 1 9
```
reading similar topics did not help that much.
Thanks in advance!
Regards
Using a literal dump from your first matrix-like frame above:
dat <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
I can do this:
library(dplyr)
library(tidyr)
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names)
# Names Code Type
# 1 John Red A
# 2 Sean Red A
# 3 Tim Red B
# 4 John Green B
# 5 Sean Green D
# 6 Tim Green C
# 7 John Black C
# 8 Sean Black C
# 9 Tim Black C
We can extend that to get your next goal:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .)
# Type
# Names A B C D
# John 1 1 1 0
# Sean 1 0 1 1
# Tim 0 1 2 0
which then just needs marginals:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .) %>%
addmargins()
# Type
# Names A B C D Sum
# John 1 1 1 0 3
# Sean 1 0 1 1 3
# Tim 0 1 2 0 3
# Sum 2 2 4 1 9
You can use reshape(). I'm not sure about your data structure, if there is a column with names or if they are row names. I've added both versions.
reshape(dat1, idvar="Names",
varying=2:4,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
# V1 Code Type
# 1 John red A
# 2 Sean red A
# 3 Tim red B
# 4 John black B
# 5 Sean black D
# 6 Tim black C
# 7 John green C
# 8 Sean green C
# 9 Tim green C
To get kind of a raw version you could do:
res <- reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code")
res
# Names Code Type
# John.1 John 1 A
# Sean.1 Sean 1 A
# Tim.1 Tim 1 B
# John.2 John 2 B
# Sean.2 Sean 2 D
# Tim.2 Tim 2 C
# John.3 John 3 C
# Sean.3 Sean 3 C
# Tim.3 Tim 3 C
After that you may assign labels at will to "Code" column by transforming to factor like so:
res$Code <- factor(res$Code, labels=c("red", "green", "black"))
Data
dat1 <- structure(list(Names = c("John", "Sean", "Tim"), Red = c("A",
"A", "B"), Green = c("B", "D", "C"), Black = c("C", "C", "C")), row.names = c(NA,
-3L), class = "data.frame")
dat2 <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), row.names = c("John", "Sean", "Tim"
), class = "data.frame")
What you aim to do is (1) creating a contingency table and then (2) compute the sum of table entries for both rows and columns.
Step1: Create a contingency table
I first pivoted the data using pivot_longer() rather than gather() because it's more intuitive. Then, apply table() to the two variables of your interest.
# Toy example
df <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
# Pivot the data
long_df <- tibble::rownames_to_column(df, var = "Names") %>%
tidyverse::pivot_longer(cols = c(-Names),
names_to = "Type",
values_to = "Code")
# Create a contingency table
df_table <- table(long_df$Names, long_df$Code)
Step 2: Compute the sum of entries for both rows and columns.
Again, I only used a base R function margin.table(). Using this approach also allows you to save the sum of the row and column entries for further analysis.
# Grand total (margin = 1 indicates rows)
df_table %>%
margin.table(margin = 1)
# Grand total (margin = 2 indicates columns)
df_table %>%
margin.table(margin = 2)

r programming: align two sequences of words

I want to align two datasets that mostly intersect on one column -- but each dataset is missing some rows. For example:
df1 <- data.frame(word = c("my", "dog", "ran", "with", "your", "dog"),
freq = c(5, 2, 2, 6, 5, 10))
df2 <- data.frame(word = c("my", "brown", "dog", "ran", "your", "dog"),
pos = c("a", "b", "c", "d", "a", "e"))
What I want as output is to have gaps inserted wherever there's a missing item. Thus in the output, the new form of df1 will have NAs where df1 was missing a word match that was in df2, and the new form of df2 will have NAs where df2 was missing a word-instance that was in df1.
As in my example, the sequence matters and elements do repeat. (so this isn't a generic "merge" situation.) I suspect DTW could figure in to the solution but I'm not sure. For present purposes it's fair to stipulate that only exact matches do match.
For the above case the desired output would be a data frame with these columns:
$word1 my NA dog ran with your dog
$freq 5 NA 2 2 6 5 2
$word2 my brown dog ran NA your dog
$pos a b c d NA a c
Thus, the sequence in each original data frame is maintained; nothing is deleted; word tokens remain tokens (it's a corpus, not a dictionary); all that's really happened is spaces (NAs) have been inserted where data are missing.
df1$count = ave(seq_along(df1$word), df1$word, FUN = seq_along)
df2$count = ave(seq_along(df2$word), df2$word, FUN = seq_along)
df1$merge = paste(df1$count, df1$word)
df2$merge = paste(df2$count, df2$word)
output = merge(x = df1, y = df2, by = "merge", all.x = TRUE, all.y = TRUE)
output[c(2, 3, 5, 6)]
# word.x freq word.y pos
#1 <NA> NA brown b
#2 dog 2 dog c
#3 my 5 my a
#4 ran 2 ran d
#5 with 6 <NA> <NA>
#6 your 5 your a
#7 dog 2 dog c

subset a dataframe based on sum of a column

I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362

Resources