This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 7 years ago.
I'd like to know how to consolidate duplicate rows in a data frame and then combine the duplicated values in another column.
Here's a sample of the existing dataframe and two dataframes that would be acceptable as a solution
df1 <- data.frame(col1 = c("test1", "test2", "test2", "test3"), col2 = c(1, 2, 3, 4))
df.ideal <- data.frame(col1 = c("test1", "test2", "test3"), col2 = c(1, "2, 3", 4))
df.ideal2 <- data.frame(col1 = c("test1", "test2", "test3"),
col2 = c(1, 2, 4),
col3 = c(NA, 3, NA))
In the first ideal dataframe, the duplicated row is collapsed and the column is added with both numbers. I've looked at other similar questions on stack overflow, but they all dealt with combining rows. I need to delete the duplicate row because I have another dataset I'm merging it with that needs the a certain number of rows. So, I want to preserve all of the values. Thanks for your help!
To go from df1 to df.ideal, you can use aggregate().
aggregate(col2~col1, df1, paste, collapse=",")
# col1 col2
# 1 test1 1
# 2 test2 2,3
# 3 test3 4
If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do
reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
# col1 col2.1 col2.2
# 1 test1 1 NA
# 2 test2 2 3
# 4 test3 4 NA
using just the base reshape() function.
Another option would be to use splitstackshape
library(data.table)
library(splitstackshape)
DT1 <- setDT(df1)[,list(col2=toString(col2)) ,col1]
DT1
# col1 col2
#1: test1 1
#2: test2 2, 3
#3: test3 4
You could split the col2 in DT1 to get the df.ideal2 or
cSplit(DT1, 'col2', sep=',')
# col1 col2_1 col2_2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA
or from df1
dcast.data.table(getanID(df1, 'col1'), col1~.id, value.var='col2')
# col1 1 2
#1: test1 1 NA
#2: test2 2 3
#3: test3 4 NA
Related
I have a data table in the format:
myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)
Col1 Col2
1: A 1
2: A 2
3: A 3
4: B 4
5: B 5
6: B 6
I want show only the highest result for each category in Col1, then collapse all others and present their sum in Col2. It should look like this:
print(myTable)
Col1 Col2
1: A 3
2: Others 3
3: B 6
4: Others 9
I managed to do it with the following code:
unique <- unique(myTable$Col1) # unique values in Col1
myTable2 <- data.table() # empty data table to populate
for(each in unique){
temp <- myTable[Col1 == each, ] # filter myTable for unique Col1 values
temp <- temp[order(-Col2)] # order filtered table increasingly
sumCol2 <- sum(temp$Col2) # sum of values in filtered Col2
temp <- temp[1, ] # retain only first element
remSum <- sumCol2 - sum(temp$Col2) # remaining sum in Col2 (without first element)
temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
myTable2 <- rbindlist(list(myTable2, temp)) # populate data table from beginning
}
This works, but I am trying to shorten a very large data table, so it takes forever.
Is there any better way to approach this?
Thanks.
UPDATE: Actually my procedure is a little bit more complicated. I figured I would be able to develop it myself after the basics were mastered but it seems I will need further help instead. I want to display the 5 highest values in Col1, and collapse the others, but some entries in Col1 do not have 5 values; in these case, all entries should be displayed, and no "Others" row should be added.
Here the data is split into groups according to the value of Col1 (by = Col1). .N is the index of the last row in the given group, so c(Col2[.N], sum(Col2) - Col2[.N])) gives the last value of Col2, and the sum of Col2 minus the last value. The newly created variables are surrounded by .() because .() is an alias for the list() function when using data.table, and the created columns need to go in a list.
library(data.table)
setDT(df)
df[, .(Col1 = c(Col1, 'Others'),
Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
, by = Col1][, -1]
# Col1 Col2
# 1: A 3
# 2: Others 3
# 3: B 6
# 4: Others 9
If it just a matter of displaying things you could the 'tables' packages :
others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)
# Col1 Col2
# A last 3
# others 3
# B last 6
# others 9
do.call(
rbind, lapply(split(myTable, factor(myTable$Col1)), function(x) rbind(x[which.max(x$Col2),], list("Other", sum(x$Col2[-which.max(x$Col2)]))))
)
# Col1 Col2
#1: A 3
#2: Other 3
#3: B 6
#4: Other 9
I did it! I made a new myTable to illustrate. I want to retain only the 4 highest values by category, and collapse the others.
set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)
Col1 Col2
1: A 8
2: A 5
3: A 2
4: B 7
5: B 10
6: B 9
7: B 12
8: B 11
9: C 4
10: C 6
11: C 3
12: C 1
# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)
# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]
# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]
# removes rows with NA inserted in first step
myTable <- na.omit(myTable)
# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4 entries
myTable <- myTable[Col2 != 0]
Owooooo!
Here's a base R solution and the dplyr equivalent:
res <- aggregate(Col2 ~.,transform(
myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
# Col0 Col2
# 1 A 3
# 3 Other 3
# 2 B 6
# 4 Other 9
myTable %>%
group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
summarize_at("Col2",sum) %>%
ungroup %>%
select(-1)
# # A tibble: 4 x 2
# Col1 Col2
# <chr> <int>
# 1 A 3
# 2 Other 3
# 3 B 6
# 4 Other 9
I want to do a union of two dataframes, that share some rows with same rowName. For those rows with common rowNames, I would like to take into account the second dataframe values, and not the first one's. For example :
df1 <- data.frame(col1 = c(1,2), col2 = c(2,4), row.names = c("row_1", "row_2"))
df1
# col1 col2
# row_1 1 2
# row_2 2 4
df2 <- data.frame(col1 = c(3,6), col2 = c(10,99), row.names = c("row_3", "row_2"))
df2
# col1 col2
# row_3 3 6
# row_2 10 99
The result I would like to obtain would then be :
someSpecificRBind(df1,df2, takeIntoAccount=df2)
# col1 col2
# row_1 1 2
# row_2 10 99
# row_3 3 6
The function rbind doesn't do the job, actually it updates rowNames for common ones.
I would conceptualize this as only adding to df2 the rows in df1 that aren't already there:
rbind(df2, df1[setdiff(rownames(df1), rownames(df2)), ])
We get the index of duplicated elements and use that to filter
rbind(df2, df1)[!duplicated(c(row.names(df2), row.names(df1))),]
I have a very messy data frame consisting of factor columns with numbers and characters. I need to filter the rows with numeric values above a threshold. However, this is a problem because my columns are factors that cannot be turned to numeric, due to the presence of characters in them.
DF <- data.frame(
Col1 = c("Egg", "", "3"),
Col2 = c("", "Flour", ""),
Col3 = c("2", "", "Bread"),
Col4 = c("4", "", ""),
Col5 = c("", "6", "8")
)
The resulting data frame looks like this:
> DF
Col1 Col2 Col3 Col4 Col5
1 Egg 2 4
2 Flour 6
3 3 Bread 8
Where each column is a factor:
> class(DF$Col1)
[1] "factor"
>
In this example, how do I filter rows with numeric values above, say, 5 in at least one column? The desired output in this example, looks like this:
> DF
Col1 Col2 Col3 Col4 Col5
2 Flour 6
3 3 Bread 8
You'll get some warnings from dplyr but this works as well:
library(dplyr)
DF %>%
mutate_all(as.character) %>%
filter_all(any_vars(if_else(is.na(as.numeric(.)), FALSE, as.numeric(.) > 5)))
Col1 Col2 Col3 Col4 Col5
1 Flour 6
2 3 Bread 8
Per #Frank's suggestion (a bit cleaner than above):
DF %>%
filter_all(any_vars(as.numeric(as.character(.)) > 5))
Col1 Col2 Col3 Col4 Col5
1 Flour 6
2 3 Bread 8
One can pick out only numeric values using gsub from each observation and convert it to numeric. Afterwards, in base-R subset with apply can provide a solution as:
subset(DF, apply(DF, 1, function(x){
#Get only numeric values and convert to numeric
val <- as.numeric(gsub("[^[:digit:]]", "",x))
any(val[!is.na(val)] > 5)
})
)
# Col1 Col2 Col3 Col4 Col5
# 2 Flour 6
# 3 3 Bread 8
One way this can be done is:
DF[do.call(function(...) pmax(..., na.rm=TRUE), data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE)) > 5,]
To explain what this is doing, the lapply(DF, as.character) is removing the factors, then lapply(lapply(DF, as.character), as.numeric) is converting the characters to numbers (the text becomes NA), and then data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE) changes it back to a dataframe, e.g.
> data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE)
Col1 Col2 Col3 Col4 Col5
1 NA NA 2 4 NA
2 NA NA NA NA 6
3 3 NA NA NA 8
The do.call with pmax finds the row maximum (thanks rowwise maximum for R) and then we can easily filter for a maximum value above 5.
Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.
I have a very large data frame that contains 100 rows and 400000 columns.
To sample each column, I can simply do:
df <- apply(df, 2, sample)
But I want every two column to be sampled together. For example, if originally col1 is c(1,2,3,4,5) and col2 is also c(6,7,8,9,10), and after resampling, col1 becomes c(1,3,2,4,5), I want col2 to be c(6,8,7,9,10) that follows the resampling pattern of col1. Same thing for col3 & col4, col5 & col6, etc.
I wrote a for loop to do this, which takes forever. Is there a better way? Thanks!
You might try this; split the data frame every two columns with split.default, for each sub data frame, sample the rows and then bind them together:
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15)
index <- seq_len(nrow(df))
cbind.data.frame(
setNames(lapply(
split.default(df, (seq_along(df) - 1) %/% 2),
function(sdf) sdf[sample(index),,drop=F]),
NULL)
)
# col1 col2 col3
#5 5 10 12
#4 4 9 11
#1 1 6 15
#2 2 7 14
#3 3 8 13