Union dataframes in some way that updates rows with same row.name

Union dataframes in some way that updates rows with same row.name - r

I want to do a union of two dataframes, that share some rows with same rowName. For those rows with common rowNames, I would like to take into account the second dataframe values, and not the first one's. For example :
df1 <- data.frame(col1 = c(1,2), col2 = c(2,4), row.names = c("row_1", "row_2"))
df1
# col1 col2
# row_1 1 2
# row_2 2 4
df2 <- data.frame(col1 = c(3,6), col2 = c(10,99), row.names = c("row_3", "row_2"))
df2
# col1 col2
# row_3 3 6
# row_2 10 99
The result I would like to obtain would then be :
someSpecificRBind(df1,df2, takeIntoAccount=df2)
# col1 col2
# row_1 1 2
# row_2 10 99
# row_3 3 6
The function rbind doesn't do the job, actually it updates rowNames for common ones.

I would conceptualize this as only adding to df2 the rows in df1 that aren't already there:
rbind(df2, df1[setdiff(rownames(df1), rownames(df2)), ])

We get the index of duplicated elements and use that to filter
rbind(df2, df1)[!duplicated(c(row.names(df2), row.names(df1))),]

Related

Collapse redundant rows in data table

I have a data table in the format:
myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)
Col1 Col2
1: A 1
2: A 2
3: A 3
4: B 4
5: B 5
6: B 6
I want show only the highest result for each category in Col1, then collapse all others and present their sum in Col2. It should look like this:
print(myTable)
Col1 Col2
1: A 3
2: Others 3
3: B 6
4: Others 9
I managed to do it with the following code:
unique <- unique(myTable$Col1) # unique values in Col1
myTable2 <- data.table() # empty data table to populate
for(each in unique){
temp <- myTable[Col1 == each, ] # filter myTable for unique Col1 values
temp <- temp[order(-Col2)] # order filtered table increasingly
sumCol2 <- sum(temp$Col2) # sum of values in filtered Col2
temp <- temp[1, ] # retain only first element
remSum <- sumCol2 - sum(temp$Col2) # remaining sum in Col2 (without first element)
temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
myTable2 <- rbindlist(list(myTable2, temp)) # populate data table from beginning
}
This works, but I am trying to shorten a very large data table, so it takes forever.
Is there any better way to approach this?
Thanks.
UPDATE: Actually my procedure is a little bit more complicated. I figured I would be able to develop it myself after the basics were mastered but it seems I will need further help instead. I want to display the 5 highest values in Col1, and collapse the others, but some entries in Col1 do not have 5 values; in these case, all entries should be displayed, and no "Others" row should be added.

Here the data is split into groups according to the value of Col1 (by = Col1). .N is the index of the last row in the given group, so c(Col2[.N], sum(Col2) - Col2[.N])) gives the last value of Col2, and the sum of Col2 minus the last value. The newly created variables are surrounded by .() because .() is an alias for the list() function when using data.table, and the created columns need to go in a list.
library(data.table)
setDT(df)
df[, .(Col1 = c(Col1, 'Others'),
Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
, by = Col1][, -1]
# Col1 Col2
# 1: A 3
# 2: Others 3
# 3: B 6
# 4: Others 9

If it just a matter of displaying things you could the 'tables' packages :
others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)
# Col1 Col2
# A last 3
# others 3
# B last 6
# others 9

do.call(
rbind, lapply(split(myTable, factor(myTable$Col1)), function(x) rbind(x[which.max(x$Col2),], list("Other", sum(x$Col2[-which.max(x$Col2)]))))
)
# Col1 Col2
#1: A 3
#2: Other 3
#3: B 6
#4: Other 9

I did it! I made a new myTable to illustrate. I want to retain only the 4 highest values by category, and collapse the others.
set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)
Col1 Col2
1: A 8
2: A 5
3: A 2
4: B 7
5: B 10
6: B 9
7: B 12
8: B 11
9: C 4
10: C 6
11: C 3
12: C 1
# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)
# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]
# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]
# removes rows with NA inserted in first step
myTable <- na.omit(myTable)
# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4 entries
myTable <- myTable[Col2 != 0]
Owooooo!

Here's a base R solution and the dplyr equivalent:
res <- aggregate(Col2 ~.,transform(
myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
# Col0 Col2
# 1 A 3
# 3 Other 3
# 2 B 6
# 4 Other 9
myTable %>%
group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
summarize_at("Col2",sum) %>%
ungroup %>%
select(-1)
# # A tibble: 4 x 2
# Col1 Col2
# <chr> <int>
# 1 A 3
# 2 Other 3
# 3 B 6
# 4 Other 9

How to delete duplicate rows (the shorter ones) based on certain columns?

Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.

You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)

df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>

You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.

resample each two columns together in a data frame in R

I have a very large data frame that contains 100 rows and 400000 columns.
To sample each column, I can simply do:
df <- apply(df, 2, sample)
But I want every two column to be sampled together. For example, if originally col1 is c(1,2,3,4,5) and col2 is also c(6,7,8,9,10), and after resampling, col1 becomes c(1,3,2,4,5), I want col2 to be c(6,8,7,9,10) that follows the resampling pattern of col1. Same thing for col3 & col4, col5 & col6, etc.
I wrote a for loop to do this, which takes forever. Is there a better way? Thanks!

You might try this; split the data frame every two columns with split.default, for each sub data frame, sample the rows and then bind them together:
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15)
index <- seq_len(nrow(df))
cbind.data.frame(
setNames(lapply(
split.default(df, (seq_along(df) - 1) %/% 2),
function(sdf) sdf[sample(index),,drop=F]),
NULL)
)
# col1 col2 col3
#5 5 10 12
#4 4 9 11
#1 1 6 15
#2 2 7 14
#3 3 8 13

In R, sort a list of dataframes by name, then calculate sum of two columns in each data frame

I searched the forum for a bit, but I couldn't find a question that's similar to the one I have. Basically, I have a list of dataframes that have the same column names. I want to first sort the dataframes in the list by number, then calculate the sum of Col1 and Col2 in each dataframes and then store it in a vector that reflects the sorted list of dataframes.
I thought list [order(names(list))] would work, but it didn't.
For example:
df1 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(2,3,4,5,6), Col3=rep(a,5))
df3 <- data.frame(Col1=c(5,4,3,2,1),Col2=c(6,5,4,3,2), Col3=rep(a,5))
df2 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(1,2,3,4,5), Col3=rep(a,5))
list <- list(df1, df3, df2)
>list
$df1
Col1 Col2 Col3
1 2 a
2 3 a
3 4 a
4 5 a
5 6 a
$df3
Col1 Col2 Col3
5 6 a
4 5 a
3 4 a
2 3 a
1 2 a
$df2
Col1 Col2 Col3
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
First, I want to sort it, like this
$df1
Col1 Col2 Col3
1 2 a
2 3 a
3 4 a
4 5 a
5 6 a
$df2
Col1 Col2 Col3
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
$df3
Col1 Col2 Col3
5 6 a
4 5 a
3 4 a
2 3 a
1 2 a
Then, I want to get the sum of Col1 and Col2 in each dataframe, and store it in a new vector (let's call it x). The result should look like this
x
35, 30, 35
With what I presented, I would imagine that there is both a for-loop solution and a lapply solution.

Here is a one line method using an anonymous function:
a = 1
df1 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(2,3,4,5,6), Col3=rep(a,5))
df3 <- data.frame(Col1=c(5,4,3,2,1),Col2=c(6,5,4,3,2), Col3=rep(a,5))
df2 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(1,2,3,4,5), Col3=rep(a,5))
list <- list(df1 = df1, df3 =df3, df2 =df2)
r = unlist(lapply(list[order(names(list))], function(df) {sum(df[,1]) + sum(df[,2])}))

Here is an approach using the sqldf package. Is this what you need?
library(sqldf)
df1 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(2,3,4,5,6))
df3 <- data.frame(Col1=c(5,4,3,2,1),Col2=c(6,5,4,3,2))
df2 <- data.frame(Col1=c(1,2,3,4,5),Col2=c(1,2,3,4,5))
list <- list(df1, df3, df2)
list
df1 <- sqldf("SELECT * FROM df1 ORDER BY Col1, Col2")
df2 <- sqldf("SELECT * FROM df2 ORDER BY Col1, Col2")
df3 <- sqldf("SELECT * FROM df3 ORDER BY Col1 DESC, Col2 DESC")
df1
df2
df3
df1 <- sqldf("SELECT SUM(Col1 +Col2) FROM df1")
df2 <- sqldf("SELECT SUM(Col1+Col2) FROM df2")
df3 <- sqldf("SELECT SUM(Col1+Col2) FROM df3")
df1
df2
df3
x <- vector()
x <- c(df1, df2, df3)
x
Which Gives the following result:
> x
$`SUM(Col1 +Col2)`
[1] 35
$`SUM(Col1+Col2)`
[1] 30
$`SUM(Col1+Col2)`
[1] 35

Only Keep Certain Combinations of Predictors in a Dataframe

Imagine that I have a data frame like this:
> col1 <- rep(1:3,10)
> col2 <- rep(c("a","b"),15)
> col3 <- rnorm(30,10,2)
> sample_df <- data.frame(col1 = col1, col2 = col2, col3 = col3)
> head(sample_df)
col1 col2 col3
1 1 a 13.460322
2 2 b 3.404398
3 3 a 8.952066
4 1 b 11.148271
5 2 a 9.808366
6 3 b 9.832299
I only want to keep combinations of predictors which, together, have a col3 standard deviation below 2. I can find the combinations using ddply, but I don't know how to backtrack to the original DF and select the correct levels.
> sample_df_summ <- ddply(sample_df, .(col1, col2), summarize, sd = sd(col3), count = length(col3))
> head(sample_df_summ)
col1 col2 sd count
1 1 a 2.702328 5
2 1 b 1.032371 5
3 2 a 2.134151 5
4 2 b 3.348726 5
5 3 a 2.444884 5
6 3 b 1.409477 5
For clarity, in this example, I'd like the DF with col1 = 3, col2 = b and col1 = 1 and col 2 = b. How would I do this?

You can add a "keep" column that is TRUE only if the standard deviation is below 2. Then, you can use a left join (merge) to add the "keep" column to the initial dataframe. In the end, you just select with keep equal to TRUE.
# add the keep column
sample_df_summ$keep <- sample_df_summ$sd < 2
sample_df_summ$sd <- NULL
sample_df_summ$count <- NULL
# join and select the rows
sample_df_keep <- merge(sample_df, sample_df_summ, by = c("col1", "col2"), all.x = TRUE, all.y = FALSE)
sample_df_keep <- sample_df_keep[sample_df_keep$keep, ]
sample_df_keep$keep <- NULL

Using dplyr:
library(dplyr)
sample_df %>% group_by(col1, col2) %>% mutate(sd = sd(col3)) %>% filter(sd < 2)
You get:
#Source: local data frame [6 x 4]
#Groups: col1, col2
#
# col1 col2 col3 sd
#1 1 a 10.516437 1.4984853
#2 1 b 11.124843 0.8652206
#3 2 a 7.585740 1.8781241
#4 3 b 9.806124 1.6644076
#5 1 a 7.381209 1.4984853
#6 1 b 9.033093 0.8652206