Here's some sample data:
dat="x1 x2 x3 x4 x5
1 C 1 16 NA 16
2 A 1 16 16 NA
3 A 1 16 16 NA
4 A 4 64 64 NA
5 C 4 64 NA 64
6 A 1 16 16 NA
7 A 1 16 16 NA
8 A 1 16 16 NA
9 B 4 64 32 32
10 A 3 48 48 NA
11 B 4 64 32 32
12 B 3 48 32 16"
data<-read.table(text=dat,header=TRUE)
aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data)
x1 x2 x3 x4 x5
1 B 11 176 96 8
How do I get the sum of A and C as well in x1?
aggregate(.~x1, FUN=sum, data=data, na.action = na.omit)
x1 x2 x3 x4 x5
1 B 11 176 96 80
When I use sqldf:
library("sqldf")
sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1")
sum(x2) sum(x3) sum(x4) sum(x5)
1 12 192 192 <NA>
2 11 176 96 80
3 5 80 NA 80
Why do I get <NA> in the first line, but NA in the third line ?
What is the differences between them? Why do I get the <NA>? there is no <NA> in data!
str(data)
'data.frame': 12 obs. of 5 variables:
$ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ...
$ x2: int 1 1 1 4 4 1 1 1 4 3 ...
$ x3: int 16 16 16 64 64 16 16 16 64 48 ...
$ x4: int NA 16 16 64 NA 16 16 16 32 48 ...
$ x5: int 16 NA NA NA 64 NA NA NA 32 NA ...
The sqldf problem remains here, why sum(x4) gets NA, on the contrary sum(x5) gets <NA>?
I can prove that all NA both in x4 and x5 is the same this way:
data[is.na(data)] <- 0
> data
x1 x2 x3 x4 x5
1 C 1 16 0 16
2 A 1 16 16 0
3 A 1 16 16 0
4 A 4 64 64 0
5 C 4 64 0 64
6 A 1 16 16 0
7 A 1 16 16 0
8 A 1 16 16 0
9 B 4 64 32 32
10 A 3 48 48 0
11 B 4 64 32 32
12 B 3 48 32 16
So the fact that sqldf treats sum(x4) and sum(x5) differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.
Here's the data.table way in case you're interested:
require(data.table)
dt <- data.table(data)
dt[, lapply(.SD, sum, na.rm=TRUE), by=x1]
# x1 x2 x3 x4 x5
# 1: C 5 80 0 80
# 2: A 12 192 192 0
# 3: B 11 176 96 80
If you want sum to return NA instead of the sum after removing NA's, just remove the na.rm=TRUE argument.
.SD here is an internal data.table variable that constructs, by default, all the columns not in by - here all except x1. You can check the contents of .SD by doing:
dt[, print(.SD), by=x1]
to get an idea of what's .SD. If you're interested check ?data.table for other internal (and very useful) special variables like .I, .N, .GRP etc..
Because of how the formula method for aggregate handles NA values by default, you need to override that before using the na.rm argument from sum. You can do this by setting na.action to NULL or na.pass:
aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data,
na.rm = TRUE, na.action = NULL)
# x1 x2 x3 x4 x5
# 1 A 12 192 192 0
# 2 B 11 176 96 80
# 3 C 5 80 0 80
aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data,
na.rm = TRUE, na.action = na.pass)
# x1 x2 x3 x4 x5
# 1 A 12 192 192 0
# 2 B 11 176 96 80
# 3 C 5 80 0 80
Regarding sqldf, it seems like the columns are being cast to different types depending on whether the item in the first row of the first grouping variable is an NA or not. If it is an NA, that column gets cast as character.
Compare:
df1 <- data.frame(id = c(1, 1, 2, 2, 2),
A = c(1, 1, NA, NA, NA),
B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df1 group by id")
# sum(A) sum(B)
# 1 2 <NA>
# 2 NA 3.0
df2 <- data.frame(id = c(2, 2, 1, 1, 1),
A = c(1, 1, NA, NA, NA),
B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df2 group by id")
# sum(A) sum(B)
# 1 <NA> 3
# 2 2.0 NA
However, there is an easy workaround: reassign the original name to the new columns being created. Perhaps that let's SQLite inherit some of the information from the previous database? (I don't really use SQL.)
Example (with the same "df2" created earlier):
sqldf("select sum(A) `A`, sum(B) `B` from df2 group by id")
# A B
# 1 NA 3
# 2 2 NA
You can easily use paste to create your select statement:
Aggs <- paste("sum(", names(data)[-1], ") `",
names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
# x2 x3 x4 x5
# 1 12 192 192 NA
# 2 11 176 96 80
# 3 5 80 NA 80
str(.Last.value)
# 'data.frame': 3 obs. of 4 variables:
# $ x2: int 12 11 5
# $ x3: int 192 176 80
# $ x4: int 192 96 NA
# $ x5: int NA 80 80
A similar approach can be taken if you want NA to be replaced with 0:
Aggs <- paste("sum(ifnull(", names(data)[-1], ", 0)) `",
names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
# x2 x3 x4 x5
# 1 12 192 192 0
# 2 11 176 96 80
# 3 5 80 0 80
aggregate(data[, -1], by=list(data$x1), FUN=sum)
I eliminated the first column because you don't use it in the sum, it is just a group variable to split the data (as a matter of fact I then used it in "by")
Here's how you would do this with the reshape package:
> # x1 = identifier variable, everything else = measured variables
> data_melted <- melt(data, id="x1", measured=c("x2", "x3", "x4", "x5"))
>
> # Thus we now have (measured variable and it's value) per x1 (id variable)
> head(data_melted)
x1 variable value
1 C x2 1
2 A x2 1
3 A x2 1
4 A x2 4
5 C x2 4
6 A x2 1
> tail(data_melted)
x1 variable value
43 A x5 NA
44 A x5 NA
45 B x5 32
46 A x5 NA
47 B x5 32
48 B x5 16
> # Now aggregate using sum, passing na.rm to it
> cast(data_melted, x1 ~ ..., sum, na.rm=TRUE)
x1 x2 x3 x4 x5
1 A 12 192 192 0
2 B 11 176 96 80
3 C 5 80 0 80
Alternatively, you could have done na.rm during the melt()-ing process itself.
The great thing about learning library(reshape) is, quoting the author ("Reshaping Data with the
reshape
Package"),
"In R, there are a number of general functions that can aggregate data,
for example tapply, by and aggregate, and a function specifically for
reshaping data, reshape. Each of these functions tends to deal well
with one or two specific scenarios, and each requires slightly different
input arguments. In practice, you need careful thought to piece
together the correct sequence of operations to get your data into the
form that you want. The reshape package grew out of my frustrations
with reshaping data for consulting clients, and overcomes these
problems with a general conceptual framework that uses just two
functions: melt and cast."
Related
I'm trying to use the function mutate is order to create a variable based on conditions regarding three others.
These conditions were created using case_when, as you may see in the code below.
But I have some conditions that uses NA valures, and these seems to be causing an error in the mutate function.
Check it out, please:
# About the variables being used:
unique(x1)
# [1] 1 0 NA
str(pemg$x1)
# num [1:1622989] 1 0 0 1 1 0 1 1 0 0 ...
unique(x2)
# [1] 16 66 38 11 8 6 14 17 53 59 10 31 50 19 48 42 44 21 54 55 56 18 57 61 13 43 7 4 15
# [30] 39 5 20 3 37 23 51 36 52 68 58 27 65 62 2 12 32 41 49 46 35 34 45 81 69 33 40 0 70
# [59] 9 47 63 29 25 22 64 24 60 30 67 26 71 72 28 1 75 80 87 77 73 78 76 79 74 83 92 102 85
# [88] 86 90 82 91 84 88 93 89 96 95 105 115 106 94 100 99 97 104 98 103 108 109 101 117 107 114 113 NA 112
# [117] 110 111
str(pemg$x2)
# num [1:1622989] 16 66 38 11 8 6 14 17 53 59 ...
unique(x3)
# [1] 6 3 4 5 0 8 2 1 11 9 10 7 NA 15
str(pemg$anoest)
# num [1:1622989] 6 3 4 5 3 0 5 8 4 2 ...
df <- mutate(df,
y = case_when(
x1 == 1 & x2 >= 7 & x3 == 0 ~ 1,
x1 == 1 & x2 >= 8 & x3 == 1 ~ 1,
x1 == 1 & x2 >= 10 & x3 == 3 ~ 1,
x1 == 1 & x2 >= 11 & x3 == 4 ~ 1,
x1 == 1 & x2 >= 12 & x3 == 5 ~ 1,
x1 == 1 & x2 >= 13 & x3 == 6 ~ 1,
x1 == 1 & x2 >= 14 & x3 == 7 ~ 1,
x1 == 1 & x2 >= 15 & x3 == 8 ~ 1,
x1 == 1 & x2 >= 16 & x3 == 9 ~ 1,
x1 == 1 & x2 >= 17 & x3 == 10 ~ 1,
x1 == 1 & x2 >= 18 & x3 == 11 ~ 1,
x1 == 1 & !is.na(x3) ~ 0,
x1 == 1 & x3 %in% 12:16 ~ 0,
x2 %in% 0:7 ~ NA,
x2 > 18 ~ NA,
x1 == 0 ~ NA,
is.na(x3) ~ NA))
# Error: Problem with `mutate()` input `defasado`.
# x must be a double vector, not a logical vector.
# i Input `defasado` is `case_when(...)`.
# Run `rlang::last_error()` to see where the error occurred.
last_error()
# <error/dplyr_error>
# Problem with `mutate()` input `y`.
# x must be a double vector, not a logical vector.
# i Input `y` is `case_when(...)`.
# Backtrace:
# 1. dplyr::mutate(...)
# 2. dplyr:::mutate.data.frame(...)
# 3. dplyr:::mutate_cols(.data, ...)
# Run `rlang::last_trace()` to see the full context.
last_trace()
# <error/dplyr_error>
# Problem with `mutate()` input `defasado`.
# x must be a double vector, not a logical vector.
# i Input `defasado` is `case_when(...)`.
# Backtrace:
# x
# 1. +-dplyr::mutate(...)
# 2. \-dplyr:::mutate.data.frame(...)
# 3. \-dplyr:::mutate_cols(.data, ...)
# <parent: error/rlang_error>
# must be a double vector, not a logical vector.
# Backtrace:
# x
# 1. +-mask$eval_all_mutate(dots[[i]])
# 2. \-dplyr::case_when(...)
# 3. \-dplyr:::replace_with(...)
# 4. \-dplyr:::check_type(val, x, name)
# 5. \-dplyr:::glubort(header, "must be {friendly_type_of(template)}, not {friendly_type_of(x)}.")
Can someone give me a hint on how to solve this?
The problem here are the results of your case_when. if_else form dplyr is stricter than ifelse from base R - all result values have to be of the same type. Since case_when is a vecotrization of multiple if_else you have to tell R which type of NA the output should be:
library(dplyr)
# does not work
dplyr::tibble(d = c(6,2,4, NA, 5)) %>%
dplyr::mutate(v = case_when(d < 4 ~ 0,
is.na(d) ~ NA))
# works
dplyr::tibble(d = c(6,2,4, NA, 5)) %>%
dplyr::mutate(v = case_when(d < 4 ~ 0,
is.na(d) ~ NA_real_))
You need to make sure your NA's are the right class. In your case, place the NA after the ~ in as.numeric(). For example:
x2 %in% 0:7 ~ as.numeric(NA)
R has different types of NA. The one you are using is of logical type, but you need the double type NA_real_ in order to be consistent with the output of your other conditions. For more information, see this: https://stat.ethz.ch/R-manual/R-patched/library/base/html/NA.html
In base R, we can construct a logical vector and assign the column values to NA based on that logical vector. Unlike case_when, we don't have to really specify the type of NA as this gets automatically converted.
df1$d[df1$d %in% 0:7] <- NA
Also, for a simple operation, it can be done in base R in a compact way
I'm trying to reformat my dataset from long to wide format, but while this is one of the most discussed topics, I couldn't find a solution for my case, nor to generalize from methods others have used.
My data is in long format, where each ID has a different number of rows (relative to other IDs.) I want to transform to wide format where each ID has one row, and the data is represented by columns with a suffix that reflects the order each value appears per ID.
To illustrate:
Notice that the NAs values don't necessarily correspond between the two formats. In the long format, NAs are simply missing from data; but in the wide format, NAs appear where values for that id fall short in filling the number of values other IDs might have for the variable x.
My Data
In real life, my data has more than one variable, and it could come in one of two versions:
Version 1 :: For each ID, values appear at the same row across variables
## reproducible data
set.seed(125)
runs_per_id <- sample(5:9, 4, replace = TRUE)
id <- rep(1:4, times = runs_per_id)
set.seed(300)
is_value <- sample (c(0, 1), size = length(id), replace = TRUE)
x <- is_value
x[which(as.logical(is_value))] <- sample(1:100, size = sum(x))
y <- is_value
y[which(as.logical(is_value))] <- sample(1:100, size = sum(y))
z <- is_value
z[which(as.logical(is_value))] <- sample(1:100, size = sum(z))
d <- as.data.frame(cbind(id, x, y, z))
d[d == 0] <- NA
d
# id x y z
# 1 1 38 63 61
# 2 1 17 27 76
# 3 1 32 81 89
# 4 1 NA NA NA
# 5 1 75 2 53
# 6 1 NA NA NA
# 7 2 NA NA NA
# 8 2 40 75 4
# 9 2 NA NA NA
# 10 2 NA NA NA
# 11 2 28 47 70
# 12 2 NA NA NA
# 13 2 71 67 33
# 14 3 NA NA NA
# 15 3 95 26 82
# 16 3 NA NA NA
# 17 3 41 7 99
# 18 3 97 8 68
# 19 4 NA NA NA
# 20 4 NA NA NA
# 21 4 93 38 58
# 22 4 NA NA NA
# 23 4 NA NA NA
Version 2 :: For each ID, values don't necessarily appear at the same row across variables
## reproducible data based on generating d from above
set.seed(12)
d2 <- data.frame(replicate(3, sample(0:1,length(id),rep=TRUE)))
d2[d2 != 0] <- sample(1:100, size = sum(d2 != 0))
d2[d2 == 0] <- NA
colnames(d2) <- c("x", "y", "z")
d2 <- as.data.frame(cbind(id, d2))
d2
## id x y z
## 1 1 18 28 5
## 2 1 85 93 22
## 3 1 55 59 NA
## 4 1 NA NA 67
## 5 1 NA 15 77
## 6 1 58 NA NA
## 7 2 NA 7 NA
## 8 2 NA NA 91
## 9 2 88 14 NA
## 10 2 13 NA NA
## 11 2 32 NA NA
## 12 2 NA 80 71
## 13 2 40 74 69
## 14 3 NA NA NA
## 15 3 96 NA 76
## 16 3 NA NA NA
## 17 3 73 66 NA
## 18 3 52 NA NA
## 19 4 56 12 16
## 20 4 53 NA NA
## 21 4 NA 42 84
## 22 4 39 99 NA
## 23 4 NA 37 NA
The Output I'm looking for
Version 1's data
Version 2's data
Trying to figure this out
I've used dplyr::spread() and even the new experimental pivot_wider() (inspired by this solution), but couldn't get it to number the occurrences of values along the variable, to be represented in the column names.
Ideally, a single solution would address both data versions I presented. It basically just needs to be agnostic to the number of values each id has in each column, and let the data dictate... I think it's a simple problem, but I just can't wrap my head around this.
Thanks!!!
The following is a solution based on #A.Suliman comment.
library(tidyr)
library(dplyr)
d %>%
# Combine all values besides id in one column
gather(key, value, -id) %>%
# Filter rows without a value
filter(!is.na(value)) %>%
group_by(id, key) %>%
# Create a new key variable numbering the key for each id
mutate(key_new = paste0(key, seq_len(n()))) %>%
ungroup() %>%
select(-key) %>%
# Spread the data with the new key
spread(key_new, value)
# A tibble: 4 x 13
# id x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 38 17 32 75 63 27 81 2 61 76 89 53
# 2 2 40 28 71 NA 75 47 67 NA 4 70 33 NA
# 3 3 95 41 97 NA 26 7 8 NA 82 99 68 NA
# 4 4 93 NA NA NA 38 NA NA NA 58 NA NA NA
For d2 instead of d it gives:
# A tibble: 4 x 13
# id x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 18 85 55 58 28 93 59 15 5 22 67 77
# 2 2 88 13 32 40 7 14 80 74 91 71 69 NA
# 3 3 96 73 52 NA 66 NA NA NA 76 NA NA NA
# 4 4 56 53 39 NA 12 42 99 37 16 84 NA NA
I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...
I have a dataframe of records of varying lengths, with NAs at the end. If there are more than three x-values in a record, I want to make the value of the third x-value equal to the value of the last x-value. Each record already tells me how many x-values it has.
I can make x3 be equal to the name of the last x-value (x4 or x5 etc) but what I really need is to make x3 take the value of that last x-value.
I'm sure there is some simple answer. Any help would be greatly appreciated! Thank you.
Here is a simple case:
ii <- "n x1 x2 x3 x4 x5 x6
1 3 30 40 20 NA NA NA
2 4 10 50 16 25 NA NA
3 6 20 15 26 16 18 28
4 5 10 10 18 17 19 NA
5 2 65 41 NA NA NA NA
6 5 10 11 23 16 23 NA
7 1 99 NA NA NA NA NA"
df <- read.table(text=ii, header = TRUE, na.strings="NA", colClasses="character")
oo <- "n x1 x2 x3
1 3 30 40 20
2 4 10 50 25
3 6 20 15 28
4 5 10 10 19
5 2 65 41 NA
6 5 10 11 23
7 1 99 NA NA"
desireddf <- read.table(text=oo, header = TRUE, na.strings="NA", colClasses="character")
df$lastx <- as.character(paste("x", df$n, sep=""))
#df$lastx <- df[[get(df$lastx)]] #How can I make lastx equal to the _value_ of lastx???
df[df$n>3, c('x3')] <- df[df$n>3, 'lastx']
df <- df[,1:4]
print(df)
yields the following, not the desireddf above.
n x1 x2 x3
1 3 30 40 20
2 4 10 50 x4
3 6 20 15 x6
4 5 10 10 x5
5 2 65 41 <NA>
6 5 10 11 x5
7 1 99 <NA> <NA>
This seems like a pretty aribtrary task, but here goes:
desireddf <- data.frame(n=df$n, x1=df$x1, x2=df$x2, x3=df[cbind(1:nrow(df), paste("x", pmax(3,as.numeric(df$n)), sep=""))])
I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...