Add two R data frames of different sizes [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
If two data frames are
symbol wgt
1 A 2
2 C 4
3 D 6
symbol wgt
1 A 20
2 D 10
how can I add them so that missing observations for a "symbol" in either data frame are treated as zero, giving
symbol wgt
1 A 22
2 C 4
3 D 16

You can join the two dataframes by symbol , replace NA with 0 and add the two weights.
library(dplyr)
df1 %>%
left_join(df2, by = 'symbol') %>%
mutate(wgt.y = replace(wgt.y, is.na(wgt.y), 0),
wgt = wgt.x + wgt.y) %>%
select(-wgt.x, -wgt.y)
# symbol wgt
#1 A 22
#2 C 4
#3 D 16
data
df1 <- structure(list(symbol = c("A", "C", "D"), wgt = c(2L, 4L, 6L)),
class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(symbol = c("A", "D"), wgt = c(20L, 10L)),
class = "data.frame", row.names = c(NA, -2L))

Try this one line solution by pipes:
#Data
library(dplyr)
df1 <- structure(list(symbol = c("A", "C", "D"), wgt = c(2L, 4L, 6L)), class = "data.frame", row.names = c("1",
"2", "3"))
df2 <- structure(list(symbol = c("A", "D"), wgt = c(20L, 10L)), class = "data.frame", row.names = c("1",
"2"))
#Code
df1 %>% left_join(df2,by = 'symbol') %>% mutate(wgt = rowSums(.[-1],na.rm=T)) %>% select(c(1,4))
symbol wgt
1 A 22
2 C 4
3 D 16

With data.table and the data provided in the answer of #RonakShah and #Duck the solution could be a simple aggregation:
# Convert data.frame to data.table (very fast since inplace)
setDT(df1)
setDT(df2)
# combine both data.frames into one data.frame, group by symbol, apply the sum (NAs are ignored = counted as zero)
rbind(df1,df2)[, sum(wgt, na.rm = TRUE), by = symbol]
# Output
symbol V1
1: A 22
2: C 4
3: D 16
Note: If you want to use base R only (without data.table) you could use aggregate instead:
aggregate(wgt ~ symbol, rbind(df1,df2), sum)

Related

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

How to collapse rows by identical values in a column

Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe
We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

Replace a subset of data frame

I have a data frame with some error
T item V1 V2
1 a 2 .1
2 a 5 .8
1 b 1 .7
2 b 2 .2
I have another data frame with corrections for items concerning V1 only
T item V1
1 a 2
2 a 6
How do I get the final data frame? Should I use merge or rbind. Note: actual data frames are big.
An option would be a data.table join on the 'T', 'item' and assigning the 'V1' with the the corresponding 'V1' column (i.V1) from the second dataset
library(data.table)
setDT(df1)[df2, V1 := i.V1, on = .(T, item)]
df1
# T item V1 V2
#1: 1 a 2 0.1
#2: 2 a 6 0.8
#3: 1 b 1 0.7
#4: 2 b 2 0.2
data
df1 <- structure(list(T = c(1L, 2L, 1L, 2L), item = c("a", "a", "b",
"b"), V1 = c(2L, 5L, 1L, 2L), V2 = c(0.1, 0.8, 0.7, 0.2)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(T = 1:2, item = c("a", "a"), V1 = c(2L, 6L)),
class = "data.frame", row.names = c(NA,
-2L))
This should work -
library(dplyr)
df1 %>%
left_join(df2, by = c("T", "item")) %>%
mutate(
V1 = coalesce(as.numeric(V1.y), as.numeric(V1.x))
) %>%
select(-V1.x, -V1.y)

How to aggregate undirected combinations in R [duplicate]

This question already has answers here:
Create unique identifier from the interchangeable combination of two variables
(2 answers)
Closed 6 years ago.
I have a dataframe of 3 columns
A B 1
A B 1
A C 1
B A 1
I want to aggregate it such that it considers combinations A-B and B-A to be the same, resulting in
A B 3
A C 1
How do I go about this?
Use pmin and pmax on the first two columns and then do the group-by-count:
library(dplyr);
df %>% group_by(G1 = pmin(V1, V2), G2 = pmax(V1, V2)) %>% summarise(Count = sum(V3))
Source: local data frame [2 x 3]
Groups: G1 [?]
G1 G2 Count
(chr) (chr) (int)
1 A B 3
2 A C 1
Corresponding data.table solution would be:
library(data.table)
setDT(df)
df[, .(Count = sum(V3)), .(G1 = pmin(V1, V2), G2 = pmax(V1, V2))]
G1 G2 Count
1: A B 3
2: A C 1
Data:
structure(list(V1 = c("A", "A", "A", "B"), V2 = c("B", "B", "C",
"A"), V3 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2", "V3"), row.names = c(NA,
-4L), class = "data.frame")

Reduce() in R over similar variable names causing error

I have 19 nested lists generated from a lapply and split operation.
These lists are in the form:
#list1
Var col1 col2 col3
A 2 3 4
B 3 4 5
#list2
Var col1 col2 col3
A 5 6 7
B 5 4 4
......
#list19
Var col1 col2 col3
A 3 6 7
B 7 4 4
I have been able to merge the lists with
merge.all <- function(x, y) merge(x, y, all=TRUE, by="Var")
out <- Reduce(merge.all, DataList)
I am however getting an error due to the similarity in the names of the other columns.
How can I concatenate the name of the list to the variable names so that I get something like this:
Var list1.col1 list1.col2 list1.col3 .......... list19.col3
A 2 3 4 7
B 3 4 5 .......... 4
I'm really sure somebody will come up with a much, much better solution. However, if you're after a quick and dirty solution, this seems to work.
My plan was to simply change the column names prior to merging.
#Sample Data
df1 <- data.frame(Var = c("A","B"), col1 = c(2,3), col2 = c(3,4), col3 = c(4,5))
df2 <- data.frame(Var = c("A","B"), col1 = c(5,5), col2 = c(6,4), col3 = c(7,5))
df19 <- data.frame(Var = c("A","B"), col1 = c(3,7), col2 = c(6,4), col3 = c(7,4))
mylist <- list(df1, df2, df19)
names(mylist) <- c("df1", "df2", "df19") #just manually naming, presumably your list has names
## Change column names by pasting name of dataframe in list with standard column names. - using ugly mix of `lapply` and a `for` loop:
mycolnames <- colnames(df1)
mycolnames1 <- lapply(names(mylist), function(x) paste0(x, mycolnames))
for(i in 1:length(mylist)){
colnames(mylist[[i]]) <- mycolnames1[[i]]
colnames(mylist[[i]])[1] <- "Var" #put Var back in so you can merge
}
## Merge
merge.all <- function(x, y)
merge(x, y, all=TRUE, by="Var")
out <- Reduce(merge.all, mylist)
out
# Var df1col1 df1col2 df1col3 df2col1 df2col2 df2col3 df19col1 df19col2 df19col3
#1 A 2 3 4 5 6 7 3 6 7
#2 B 3 4 5 5 4 5 7 4 4
There you go - it works but is very ugly.
To set the data frame names unique, you could use a function to set all list names that are not the merging variable to unique names.
resetNames <- function(x, byvar = "Var") {
asrl <- as.relistable(lapply(x, names))
allnm <- names(unlist(x, recursive = FALSE))
rpl <- replace(allnm, unlist(asrl) %in% byvar, byvar)
Map(setNames, x, relist(rpl, asrl))
}
Reduce(merge.all, resetNames(dlist))
# Var list1.col1 list1.col2 list1.col3 list2.col1 list2.col2 list2.col4 list3.col1
#1 A 2 3 4 5 6 7 3
#2 B 3 4 5 5 4 4 7
# list3.col2 list3.col3 list4.col1 list4.col2 list4.col3
#1 6 7 3 6 7
#2 4 4 4 5 6
when run your list with an added data frame there are no warnings. And there's always data table. Its merge method does not return a warning for duplicated column names.
library(data.table)
Reduce(merge.all, lapply(dlist, as.data.table))
Another option is to check the names as the data enters the function, change them there, and then you can avoid the warning. This isn't perfect but it works ok here.
merge.all <- function(x, y) {
m <- match(names(y)[-1], gsub("[.](x|y)$", "", names(x)[-1]), 0L)
names(y)[-1][m] <- paste0(names(y)[-1][m], "DUPE")
merge(x, y, all=TRUE, by="Var")
}
rm <- Reduce(merge.all, dlist)
names(rm)
# [1] "Var" "col1" "col2" "col3" "col1DUPE.x"
# [6] "col2DUPE.x" "col4" "col1DUPE.y" "col2DUPE.y" "col3DUPE.x"
# [11] "col1DUPE" "col2DUPE" "col3DUPE.y"
where dlist is
structure(list(list1 = structure(list(Var = structure(1:2, .Label = c("A",
"B"), class = "factor"), col1 = 2:3, col2 = 3:4, col3 = 4:5), .Names = c("Var",
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-2L)), list2 = structure(list(Var = structure(1:2, .Label = c("A",
"B"), class = "factor"), col1 = c(5L, 5L), col2 = c(6L, 4L),
col4 = c(7L, 4L)), .Names = c("Var", "col1", "col2", "col4"
), class = "data.frame", row.names = c(NA, -2L)), list3 = structure(list(
Var = structure(1:2, .Label = c("A", "B"), class = "factor"),
col1 = c(3L, 7L), col2 = c(6L, 4L), col3 = c(7L, 4L)), .Names = c("Var",
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-2L)), list4 = structure(list(Var = structure(1:2, .Label = c("A",
"B"), class = "factor"), col1 = 3:4, col2 = c(6L, 5L), col3 = c(7L,
6L)), .Names = c("Var", "col1", "col2", "col3"), row.names = c(NA,
-2L), class = "data.frame")), .Names = c("list1", "list2", "list3",
"list4"))

Resources