How to sum and combine two data frames? - r

I have two data frames:
DATA1:
ID com_alc_cd com_liv_cd com_hyee_cd
A 1 0 0
B 0 0 1
D 0 0 0
C 0 1 0
DATA2:
ID com_alc_dd com_liv_dd com_hyee_dd
B 0 2 0
A 1 0 2
C 0 1 0
D 0 1 0
I want to combine the two data frames, so as to obtain the sum of the two:
SUM(DATA1, DATA2):
ID com_alc com_liv com_hyee
A 2 0 2
B 0 2 1
C 0 2 0
D 0 1 0

Try this for example( assuming that your data.frames are matrix of the same size)
d1 <- DATA1[order(DATA1$ID),]
d2 <- DATA2[order(DATA2$ID),]
data.frame(ID=d1$ID,as.matrix(subset(d1,select=-ID)) +
as.matrix(subset(d2,select=-ID)))
ID com_alc_cd com_liv_cd com_hyee_cd
1 A 2 0 2
2 B 0 2 1
4 C 0 2 0
3 D 0 1 0
EDIT general solution
library(reshape2)
## put the data in the long format
res <- do.call(rbind,lapply(list(DATA1,DATA2),melt,id.vars='ID'))
## polish names
res$variable <- gsub('(.*_.*)_.*','\\1',res$variable)
## wide format and aggregate using sum
dcast(ID~variable,data=res,fun.aggregate=sum)
ID com_alc com_hyee com_liv
1 A 2 2 0
2 B 0 1 2
3 C 0 0 2
4 D 0 0 1

You can also use aggregate
names(df1) <- names(df2)
df3 <- rbind(df1, df2)
res <- aggregate(df3[,-1], by=list(df3$ID), sum)

Related

A Pivot table in r with binary output [duplicate]

This question already has answers here:
R: Convert delimited string into variables
(3 answers)
Closed 5 years ago.
I have the following dataset
#datset
id attributes value
1 a,b,c 1
2 c,d 0
3 b,e 1
I wish to make a pivot table out of them and assign binary values to the attribute (1 to the attributes if they exist otherwise assign 0 to them). My ideal output will be the following:
#output
id a b c d e Value
1 1 1 1 0 0 1
2 0 0 1 1 0 0
3 0 1 0 0 1 1
Any tip is really appreciated.
We split the 'attributes' column by ',', get the frequency with mtabulate from qdapTools and cbind with the first and third column.
library(qdapTools)
cbind(df1[1], mtabulate(strsplit(df1$attributes, ",")), df1[3])
# id a b c d e value
#1 1 1 1 1 0 0 1
#2 2 0 0 1 1 0 0
#3 3 0 1 0 0 1 1
With base R:
attributes <- sort(unique(unlist(strsplit(as.character(df$attributes), split=','))))
cols <- as.data.frame(matrix(rep(0, nrow(df)*length(attributes)), ncol=length(attributes)))
names(cols) <- attributes
df <- cbind.data.frame(df, cols)
df <- as.data.frame(t(apply(df, 1, function(x){attributes <- strsplit(x['attributes'], split=','); x[unlist(attributes)] <- 1;x})))[c('id', attributes, 'value')]
df
id a b c d e value
1 1 1 1 1 0 0 1
2 2 0 0 1 1 0 0
3 3 0 1 0 0 1 1

find how many times and which columns a string is repeated

My data consists of 6 strings per each element. It has string with 6 characters. The data has white space too.
I want to know how many times each string is repeated in all columns
for example P67809 is repeated 2 times in column a and column d
so the output should look likes
string No columns
P67809 2 a,b
Based on this function I can assign a row number to each string
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
Then I apply the function on all and each columns string like
myS <- lapply(mydata, normalize,";")
but I don't know how to then search and get the output
We could melt the data from 'wide' to 'long' format. Split the 'value' column with ; to get a list output. We set the names of the list as the 'variable' column of 'dM'. Then stack the list to a two column output, and get the frequency count with 'tbl'. It may be easier to understand the result from the 'tbl' output.
library(reshape2)
dM <- melt(mydata, id.var=NULL)
lst1 <- setNames(strsplit(dM$value, ";"), dM$variable)
tbl <- table(stack(lst1)[2:1])
tbl
values
#ind A4QPH2 O60814 P0CG47 P0CG48 P14923 P15924 P19338 P35908 P42356 P57053 P58876 P62750 P62807 P62851 P62979 P63241 P67809 Q02413 Q06830 Q07955 Q16658 Q5QNW6 Q6IS14 Q8N8J0 Q93079 Q969S3
# a 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
# b 3 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0
# c 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0
# d 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1
# values
#ind Q99877 Q99879 Q9Y2T7
# a 0 0 1
# b 0 0 0
# c 0 0 0
# d 1 1 1
We get the total number of each element with colSums.
cS <- colSums(tbl)
If we need to get the output as in the OP's post, we can melt the list output to create a 2 column data.frame. From this, we convert to 'data.table' (setDT(), grouped by 'value' column , we get the length of unique elements of 'variable' and also paste together the unique elements.
library(data.table)
res <- setDT(melt(lst1))[, list(No= uniqueN(L1),
columns= toString(unique(L1))) ,.(string=value)]
head(res,2)
# string No columns
#1: P67809 2 a, d
#2: Q9Y2T7 2 a, d
One approach might be:
res <- apply(mydata, 2, function(x) unlist(strsplit(x, ";")))
un <- unique(unlist(res))
res2 <- sapply(un, function(x) lapply(res, function(y) as.numeric(x %in% y)))
res2
P67809 Q9Y2T7 P42356 Q8N8J0 A4QPH2 P35908 P19338 P15924 P14923 Q02413 P63241 Q6IS14
a 1 1 1 1 1 1 1 1 1 0 0 0
b 0 0 0 0 0 0 0 0 0 1 1 1
c 0 0 0 0 0 0 0 0 0 1 1 1
d 1 1 0 0 0 0 0 0 0 0 0 0
P62979 P0CG47 P0CG48 Q16658 P62851 Q07955 Q06830 P62807 O60814 P57053 Q99879 Q99877
a 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 0 0 0 0 0 0 0
d 1 1 1 0 0 0 1 1 1 1 1 1 1
Q93079 Q5QNW6 P58876 P62750 Q969S3
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 1 1 1 1 1
as.data.frame(t(apply(t(res2), 1, function(x) cbind(sum(as.numeric(x)), paste(names(x)[which(as.logical(x))], collapse = ",")))))
V1 V2
P67809 2 a,d
Q9Y2T7 2 a,d
P42356 1 a
Q8N8J0 1 a
A4QPH2 1 a
P35908 1 a
P19338 1 a
P15924 1 a
P14923 1 a
Q02413 2 b,c
P63241 2 b,c
Q6IS14 2 b,c
P62979 3 b,c,d
P0CG47 3 b,c,d
P0CG48 3 b,c,d
2 b,c
Q16658 1 c
P62851 1 c
Q07955 1 d
Q06830 1 d
P62807 1 d
O60814 1 d
P57053 1 d
Q99879 1 d
Q99877 1 d
Q93079 1 d
Q5QNW6 1 d
P58876 1 d
P62750 1 d
Q969S3 1 d
An alternative approach with cSplit from splitstackshape and gather from tidyr.
library(splitstackshape)
library(tidyr)
library(dplyr)
splitted <- cSplit(mydata, splitCols = names(mydata), sep = ";") %>% gather() # Split cols and melt data
splitted$key <- substring(splitted$key, 1, 1) # Lose irrelevant string
table(splitted) # Generate frequency table

Split data frame into chunk and assign names to chunks from vectors

I have a data frame with 5*n columns, where n is the number of categories listed in a vector. I want to break the data frame into chunks of 5 columns (eg. category 1 is columns 1:5, category 2 is columns 6:10) and then assign the category names from the vector to the chunks.
eg.
*original data frame* *vector of category names*
X a b c d e a b c d e a b c d e 1 apples
1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 2 oranges
2 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 3 bananas
Will become
*apples* *oranges* *bananas*
X a b c d e X a b c d e X a b c d e
1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0
2 0 1 0 1 0 2 0 0 1 0 1 2 1 0 0 0 1
I can find a whole lot of information about splitting data.frames by rows, which is much more common to do, but I can't find anything about splitting a data frame into n chunks by columns. Thanks!
You could split your original_data_frame by column indices similarely:
df <- read.table(header=T, check.names = F, text="
X a b c d e a b c d e a b c d e
1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
2 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1")
n <- 5 # fixed chunksize (a-e)
lst <- lapply(split(2:ncol(df), rep(seq(ncol(df[-1])/n), each=n)), function(x) df[, x])
names(lst) <- c("apples", "oranges", "bananas")
# lst
# $apples
# a b c d e
# 1 1 0 0 0 1
# 2 0 1 0 1 0
#
# $oranges
# a b c d e
# 1 0 1 0 1 0
# 2 0 0 1 0 1
#
# $bananas
# a b c d e
# 1 0 0 1 1 0
# 2 1 0 0 0 1
I don't know if this is elegant, but it came to my mind, first.

Vectorizing a for-loop that merges two data frames by column

Suppose I have two dataframes df1 and df2.
df1 <- data.frame(matrix(c(0,0,1,0,0,1,1,1,0,1),ncol=10,nrow=1))
colnames(df1) <- LETTERS[seq(1,10)]
df2 <- data.frame(matrix(c(1,1,1,1),ncol=4,nrow=1))
colnames(df2) <- c("C","D","A","I")
Some of the column names in df2 match column names in df1 and df1 always contains every possible column name that can occur in df2. I want to append df1 with a new row which holds the value of df2 for matching columns and a 0 for non-matching columns. My current approach uses a for-loop:
for(i in 1:ncol(df1)){
if(colnames(df1)[i] %in% colnames(df2)){
df1[2,i] <- df2[1,which(colnames(df2)==colnames(df1)[i])]
} else {
df1[2,i] <- 0
}
}
Well, it works. But I wonder if there is a cleaner (and faster) solution for this task, perhaps taking advantage of vectorized operations.
res <-merge(df1,df2,all=T)[,colnames(df1)]
res[is.na(res)] <- 0
res
# A B C D E F G H I J
# 1 0 0 1 0 0 1 1 1 0 1
# 2 1 0 1 1 0 0 0 0 1 0
Possibly more efficient would be rbind_all from "dplyr":
library(dplyr)
rbind_list(df1, df2)
# A B C D E F G H I J
# 1 0 0 1 0 0 1 1 1 0 1
# 2 1 NA 1 1 NA NA NA NA 1 NA
Assign to "res" and replace NA with "0" in the same way identified by #akrun.
Just using assignment:
df1[2,] <- 0
df1[2,names(df2)] <- df2
# A B C D E F G H I J
#1 0 0 1 0 0 1 1 1 0 1
#2 1 0 1 1 0 0 0 0 1 0
...and just to prove it works with other values:
df2$C <- 8
df1[2,] <- 0
df1[2,names(df2)] <- df2
# A B C D E F G H I J
#1 0 0 1 0 0 1 1 1 0 1
#2 1 0 8 1 0 0 0 0 1 0

Multiplying and combine two data frames in R

I have two data frames
data frame 1
A B C
1 1 0
0 0 0
1 1 0
data frame 2
1 2
0 4
100 0
100 4
I need to multiply and combine the columns to obtain
A1 A2 B1 B2 C1 C2
0 4 0 4 0 0
0 0 0 0 0 0
100 4 100 4 0 0
Here's one approach:
do.call(cbind, lapply(df1, "*", as.matrix(df2)))
1 2 1 2 1 2
[1,] 0 4 0 4 0 0
[2,] 0 0 0 0 0 0
[3,] 100 4 100 4 0 0
This returns a matrix. You can use as.data.frame to turn it into a data frame if it's necessary.
This is based on the following data:
df1 <- data.frame(A = c(1,0,1), B = c(1,0,1), C = 0)
df2 <- data.frame("1" = c(0,100,100), "2" = c(4,0,4),
check.names = FALSE)

Resources