Sum of hybrid data frames depending on multiple conditions in R - r

This is a more complex follow-up to my previous question. The answer there was to use a matrix, but that doesn't work with data frames having values of different modes.
I want to combine data frames of different sizes, with character and integer columns, and calculate their sum depending on multiple conditions.
Conditions
sums are only calculated for those rows that have a matching "Name"-value
sums are calculated for matching column names only
if a cell in df4 is not 0 and not NA, the sum should be df3 + df4
else the sum should be df1 + df2 + df3
Example
> df1 <- data.frame(Name=c("Joe","Ann","Lee","Dan"), "1"=c(0,1,5,2), "2"=c(3,1,0,0), "3"=c(2,0,2,2), "4"=c(2,1,3,4))
> df1
Name X1 X2 X3 X4
1 Joe 0 3 2 2
2 Ann 1 1 0 1
3 Lee 5 0 2 3
4 Dan 2 0 2 4
> df2 <- data.frame(Name=c("Joe","Ann","Ken"), "1"=c(3,4,1), "2"=c(2,3,0), "3"=c(2,4,3))
> df2
Name X1 X2 X3
1 Joe 3 2 2
2 Ann 4 3 4
3 Ken 1 0 3
> df3 <- data.frame(Name=c("Lee","Ben"), "1"=c(1,3), "2"=c(3,4), "3"=c(4,3))
> df3
Name X1 X2 X3
1 Lee 1 3 4
2 Ben 3 4 3
The condition depends on this frame:
> df4 <- data.frame(Name=c("Lee","Ann","Dan"), "1"=c(6,0,NA), "2"=c(0,0,4), "3"=c(0,NA,0))
> df4
Name X1 X2 X3
1 Lee 6 0 0
2 Ann 0 0 NA
3 Dan NA 4 0
With the above examples, this is the expected result (* values depend on df4):
> dfsum
Name X1 X2 X3 X4
1 Joe 3 5 4 2
2 Ann 5 4 4 1
3 Lee 7* 3 6 3
4 Dan 2 4* 2 4
5 Ken 1 0 3 NA
6 Ben 3 4 3 NA
Possible steps?
First expand df1, df2, df3, df4 to 5 columns and 6 rows, fill missing data with NA.
Then for each data frame:
sort rows by "Name"
separate "Name" column from "X1"..."X4"
transform "X1"..."X4" columns to matrix
calculate sums of the matrices like in the answer to my other question but with the additional condition 1
transform result matrix to data frame
cbind the "Name" column with the result data frame
How can this be done in R?
Solution
#Ricardo Saporta's solution works with little changes:
Add , padValue=NA) in the four addCols().
As answered here, replace the definitions of sumD3D4 and dtsum with:
plus <- function(x) {
if(all(is.na(x))){
c(x[0],NA)} else {
sum(x,na.rm = TRUE)}
}
sumD3D4 <- setkey(rbind(dt3, dt4)[,lapply(.SD, plus), by = Name], "Name")
dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, plus), by=Name], "Name")

If you use data.table instead of data.frame, you could use its by=xxxx feature, to add by name.
The code below should give you your expected results.
Please note that I am padding the data.tables with extra empty columns. However, we compute condTrue prior to then.
library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
# make sure all dt's have the same columns
#-----------------------------------------#
# identify which dt4 satisfy the condition
condTrue <- as.data.table(which(!(is.na(dt4) | dt4==0), arr.ind=TRUE))
# ignore column "Name" from dt4
condTrue <- condTrue[col>1]
# convert from (row, col) index to ("Name", columnName)
condTrue <- data.table(Name=dt4[condTrue$row, Name], colm=names(dt4)[condTrue$col], key="Name")
# First make a list of all the unique column names
allColumnNames <- unique(c(names(dt1), names(dt2), names(dt3), names(dt4)))
# add columns as necessary, using addCols (definted below)
addCols(dt1, allColumnNames)
addCols(dt2, allColumnNames)
addCols(dt3, allColumnNames)
addCols(dt4, allColumnNames)
sumD3D4 <- setkey(rbind(dt3, dt4)[, lapply(.SD, sum), by=Name], "Name")
dtsum <- setkey(rbind(dt1, dt2, dt3)[, lapply(.SD, sum), by=Name], "Name")
for (Nam in condTrue$Name) {
colsRepl <- condTrue[.(Nam)]$colm
valsRepl <- unlist(sumD3D4[.(Nam), c(colsRepl), with=FALSE])
dtsum[.(Nam), c(colsRepl) := as.list(valsRepl)]
}
dtsum
# Name 1 2 3 4
# 1: Ann 5 4 4 1
# 2: Ben 3 4 3 0
# 3: Dan 2 4 2 4
# 4: Joe 3 5 4 2
# 5: Ken 1 0 3 0
# 6: Lee 7 3 6 3
addCols <- function(x, cols, padValue=0) {
# adds to x any columns that are in cols but not in x
# Returns TRUE if columns were added
# FALSE if no columns added
colsMissing <- setdiff(cols, names(x))
# grab the actual DT name that was passed to function
dtName <- as.character(match.call()[2])
if (length(colsMissing)) {
get(dtName, envir=parent.frame(1))[, c(colsMissing) := padValue]
return(TRUE)
}
return(FALSE)
}

Related

Conditional merging based on full join

I would like to conditionally merge two datasets such that the values in dataframe2 replace the values in dataframe1, unless dataframe2 contains missing values. This should be performed in the case of a full join such that rows from both dataframe are preserved.
This question is inspired from Conditional merge/replacement in R (which seems to work only for inner join).
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:5,x2=c("zz","qq", NA, "qy"),stringsAsFactors=FALSE)
I would like the following result:
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
5 5 qy
I tried the following code though it returns NA for the 4th column but I would like the original value to be preserved since in this case df2 contains missing value for 4.
df3 <- anti_join(df1, df2, by = "x1")
rbind(df3, df2)
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 <NA>
5 5 qy
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
5 5 qy

merging more than 2 data frames whilst assigning an identifier factor in R

Take this very simple RWE, I want to know what package can be used to automatically assign a factor (preferable the data frame name) when we merge two or more data.frames
I have manually defined the factor in the example below and shown the desired output. But i want to automate it as I have over 100 tables to merge. Note that the headers within each df are constant, only the name itself changes
A <- 1:5
B <- 5:1
df1 <- data.frame(A,B)
A <- 2:6
B <- 6:2
df2 <- data.frame(A,B)
df1$ID <- rep("df1", 5)
df2$ID <- rep("df2", 5)
big_df <- rbind(df1,df2)
Assuming that your data.frame names follow a certain pattern like beginning with "df" followed by numbers and they are not inside a list but simply in your global environment, you can use the following:
library(data.table)
bigdf <- rbindlist(Filter(is.data.frame, mget(ls(pattern = "^df\\d+"))), id = "ID")
Without data.table, you could do it as follows:
lst <- Filter(is.data.frame, mget(ls(pattern = "^df\\d+")))
bigdf <- do.call(rbind, Map(function(df, id) transform(df, ID=id), lst, names(lst)))
Consider the following:
library(dplyr)
cof_df <- bind_rows(df1, df2, .id="ID")
cof_df
ID A B
1 1 1 5
2 1 2 4
3 1 3 3
4 1 4 2
5 1 5 1
6 2 2 6
7 2 3 5
8 2 4 4
9 2 5 3
10 2 6 2
And then:
cof_df$ID <- factor(cof_df$ID,
levels = c(1,2),
labels = paste0("df", unique(cof_df$ID)))
does the recoding.
A similar result can be obtained by naming the arguments in bind_rows, as in
cof_df <- bind_rows(df1=df1, df2=df2, .id="ID")
Another solution will be to use merge:
merged <- merge(df1, df2, all=TRUE, sort =FALSE)
> merged
A B ID
1 1 5 df1
2 2 4 df1
3 3 3 df1
4 4 2 df1
5 5 1 df1
6 2 6 df2
7 3 5 df2
8 4 4 df2
9 5 3 df2
10 6 2 df2

Why are there differences in using merge and %in%?

I have two datasets that I'd like to merge via two identifying variables (up and ver_u):
df1 looks like this:
up ver_u
257001 1
1010 1
101010 1
100316 1
df2 looks like this:
up ver_u code_uc quantity
500116 1 395884 1
100116 1 36761 2
160116 1 81308 3
100116 1 76146 1
113216 1 6338 1
101116 1 33887 1
What I would like to do is to take out a subset of df2 where their up and ver_u matches with those in df1. I did this in two different ways and I got different answers.
First method:
pur <- merge(df2, df1,by=c("up","ver_u"))
Second method:
test <- df2[(df2$up %in% df1$up) & (df2$ver_u %in% df1$ver_u),]
They are giving me different number of observations and I don't see why they are giving me a difference.
When I used merge on dataframe test with the following code, I got the same number of observations, but the two resulting dataframes I got are still different.
pur1 = merge(test, df1,by=c("up","ver_u"))
Is there some systematic differences of using merge and %in%?
Would greatly appreciate any insight on this.
Because merge is comparing row by row for both columns, while %in% is comparing one row by all other rows. Example:
#dummy data
df1 <- data.frame(x = c(1,2,3),
y = c(2,3,4))
df1
# x y
# 2 2 3
# 3 3 4
df2 <- data.frame(x = c(2,3,1,3),
y = c(3,1,4,1))
df2
# x y
# 1 2 3
# 2 3 1
# 3 1 4
# 4 3 1
# using merge
merge(df1, df2, by = c("x", "y"))
# x y
# 1 2 3
# using %in%
df1[(df1$x %in% df2$x) & (df1$y %in% df2$y), ]
# x y
# 2 2 3
# 3 3 4

Reshape data frame in R: rows to columns

there are 3 columns in the original data frame: id, type and rank. Now I want to create a new data frame having each possible value of type as a single column (see the small example below, the original data contains >100.000 rows and 30 types)
data1
id type rank
x a 1
y a 2
z a 3
x b 1
z b 2
y c 1
data2
id a b c
x 1 1 NA
y 2 NA 1
z 3 2 NA
That's what I have done so far:
for (i in (1:nrow(data1))) {
dtype <- data[i,2]
if (any(data2$id == data1[i,1], na.rm = TRUE)) {
row <- grep(data1[i,1],data2$id)
data2[row,c(dtype)] <- data1[i,3]
} else {
data2[nrow(data2)+1,1] <- as.character(data1[i,1])
data2[nrow(data2),c(dtype)] <- data1[i,3]
}
}
This works (I hope this example explains what I am doing), but it is quite slow. Do you have any hints how I can optimize this algorithm?
Using the function by the word mentioned in your question, you can just use reshape from base R:
> reshape(mydf, direction = "wide", idvar = "id", timevar = "type")
id rank.a rank.b rank.c
1 x 1 1 NA
2 y 2 NA 1
3 z 3 2 NA
Here's an example from the tidyr package.
library("tidyr")
library("dplyr")
data2<-
data1 %>% spread(type, rank)
id a b c
1 x 1 1 NA
2 y 2 NA 1
3 z 3 2 NA
Here's using data.table:
require(data.table)
ans = dcast.data.table(setDT(data1), id ~ type)
ans
# id a b c
# 1: x 1 1 NA
# 2: y 2 NA 1
# 3: z 3 2 NA

Create new data frame depending on the most extreme value in rows

I have the following data frame and I would like to create a new one that will be like the one below.
ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7
New data frame
ID1 ID2 ID3 ID4
X x3 x2 x2 x2
Y y2 y2 y1 y2
Z z2 z2 z2 z2
Basically the idea is the following:
For each ID I want to find which of the rownames (x1_X,x2_X,x3_X) has the most extreme value and assign this to name X since in the rownames I have subgroups.
My data frame is huge: 1700 columns and 100000 rows.
First we need to split the group and subgroup labels:
grp <- strsplit(row.names(df), "_")
And if performance is an issue, I think data.table is our best choice:
library(data.table)
df$group <- sapply(grp, "[", 2)
subgroup <- sapply(grp, "[", 1)
dt <- data.table(df)
And we now have access to the single line:
result <- dt[,lapply(.SD, function(x) subgroup[.I[which.max(x)]]), by=group]
Which splits the data.table by the character after the underscore (by=group) and then, for every column of the rectangular subset (.SD) we get the index in the sub-rectangle (which.max), and then map it back to the whole data.table (.I), and then extract the relevant subgroup (subgroup).
The data.table package is meant to be quite efficient, though you might want to look into indexing your data.table if you're going to be querying it multiple times.
Your table:
df <- read.table (text= " ID1 ID2 ID3 ID4
x1_X 0 10 4 7
x2_X 2 12 5 8
x3_X 3 1 3 5
y1_Y 4 13 6 4
y2_Y 5 14 1 9
y3_Y 2 11 1 5
y4_Y 1 1 2 3
z1_Z 1 0 0 5
z2_Z 3 6 7 7", header = T)
Split rownames to get groups:
library(plyr)
df_names <- ldply(strsplit (rownames(df), "_"))
colnames(df_names) <- c ("group1", "group2")
df2 <- cbind (df, df_names)
Create new table:
df_new <- data.frame (matrix(nrow = length(unique (df2$group2)),
ncol = ncol(df)))
colnames(df_new) <- colnames(df)
rownames (df_new) <- unique (df_names[["group2"]])
Filling new table with a loop:
for (i in 1:ncol (df_new)) {
for (k in 1:nrow (df_new)) {
col0 <- colnames (df_new)[i]
row0 <- rownames (df_new)[k]
sub0 <- df2 [df2$group2 == row0, c(col0, "group1")]
df_new [k,i] <- sub0 [sub0[1]==max (sub0[1]), 2]
}
}

Resources