How to output duplicated rows

How to output duplicated rows - r

I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22

You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5

An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5

Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20

There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22

Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22

first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...

Related

How to iteratively sum the diagonals of an incidence table in R

I have a dataframe of incident cases of a disease. by year and age, which looks like this (it is much larger than this example)
88 89 90 91
22 1 2 5 14
23 1 6 9 15
24 2 5 12 11
25 3 3 7 20
What I would like to do is iteratively sum the diagonals, to get this result
88 89 90 91
22 1 2 5 14
23 1 7 11 20
24 2 6 19 22
25 3 5 13 39
Or, put another way; original dataset:
Y1 Y2 Y3 Y4
22 A1 B1 C1 D1
23 A2 B2 C2 D2
24 A3 B3 C3 D3
25 A4 B4 C4 D4
Final dataset:
Y1 Y2 Y3 Y4
22 A1 B1 C1 D1
23 A2 A1+B2 B1+C2 C1+D2
24 A3 A2+B3 A1+B2+C3 B1+C2+D3
25 A4 A3+B4 A2+B3+C4 A1+B2+C3+D4
Is there any way to do this in R?
I have seen this question How to sum over diagonals of data frame, but he only wants the total sum, I want the iterative sum.
Thanks.

Use ave noting that row(m) - col(m) is constant on diagonals:
ave(m, row(m) - col(m), FUN = cumsum)
## 88 89 90 91
## 22 1 2 5 14
## 23 1 7 11 20
## 24 2 6 19 22
## 25 3 5 13 39
It is assumed that m is a matrix as in the Note below. If you have a data frame then convert it to a matrix first.
Note
The input matrix m in reproducible form is:
Lines <- " 88 89 90 91
22 1 2 5 14
23 1 6 9 15
24 2 5 12 11
25 3 3 7 20"
m <- as.matrix(read.table(text = Lines, check.names = FALSE))

find duplicated rows of a data frame in R [duplicate]

I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22

You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5

An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5

Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20

There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22

Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22

first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...

How to reference the values of different columns in a dataframe depending upon the data in them

I have a dataframe of records of varying lengths, with NAs at the end. If there are more than three x-values in a record, I want to make the value of the third x-value equal to the value of the last x-value. Each record already tells me how many x-values it has.
I can make x3 be equal to the name of the last x-value (x4 or x5 etc) but what I really need is to make x3 take the value of that last x-value.
I'm sure there is some simple answer. Any help would be greatly appreciated! Thank you.
Here is a simple case:
ii <- "n x1 x2 x3 x4 x5 x6
1 3 30 40 20 NA NA NA
2 4 10 50 16 25 NA NA
3 6 20 15 26 16 18 28
4 5 10 10 18 17 19 NA
5 2 65 41 NA NA NA NA
6 5 10 11 23 16 23 NA
7 1 99 NA NA NA NA NA"
df <- read.table(text=ii, header = TRUE, na.strings="NA", colClasses="character")
oo <- "n x1 x2 x3
1 3 30 40 20
2 4 10 50 25
3 6 20 15 28
4 5 10 10 19
5 2 65 41 NA
6 5 10 11 23
7 1 99 NA NA"
desireddf <- read.table(text=oo, header = TRUE, na.strings="NA", colClasses="character")
df$lastx <- as.character(paste("x", df$n, sep=""))
#df$lastx <- df[[get(df$lastx)]] #How can I make lastx equal to the _value_ of lastx???
df[df$n>3, c('x3')] <- df[df$n>3, 'lastx']
df <- df[,1:4]
print(df)
yields the following, not the desireddf above.
n x1 x2 x3
1 3 30 40 20
2 4 10 50 x4
3 6 20 15 x6
4 5 10 10 x5
5 2 65 41 <NA>
6 5 10 11 x5
7 1 99 <NA> <NA>

This seems like a pretty aribtrary task, but here goes:
desireddf <- data.frame(n=df$n, x1=df$x1, x2=df$x2, x3=df[cbind(1:nrow(df), paste("x", pmax(3,as.numeric(df$n)), sep=""))])

Subsetting top 4 observations of each unique ID

I have a dataframe of 4 columns and a few thousands rows. I am ordering the dataframe according to thier 4th column-which is their ID-(descending) then to the second column (ascending). Here's what my data looks like:
X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
24 19 16 23
19 20 7 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
16 20 6 21
I am trying to keep the highest 4 values of each ID (if available), my desired output would be
X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
# note that 2 of the 23 ID observations and one of the 21 ID observations were removed.
I was wondering if there is there some short command that can do the job for me? I can think of a command that is around 1 page long! which is subsetting the data according to the 4th column, taking the top 5, then rbind them again. But that sounds so unprofessional!
Here's a command to generate similar example:
m0 <- matrix(0, 100, 4)
df <- data.frame(apply(m0, c(1,2), function(x) sample(c(0:25),1)))
##fix(df)
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
Thanks all.

maybe data.table:
require(data.table)
df<-read.table(header=T,text=" X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
24 19 16 23
19 20 7 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
16 20 6 21")
data.table(df)[,.SD[order(X2)][1:4,],by="X4"][!is.na(X3)][,list(X1,X2,X3,X4)]
X1 X2 X3 X4
1: 24 1 23 25
2: 21 3 19 25
3: 19 6 20 25
4: 11 12 14 25
5: 14 9 21 24
6: 3 12 25 24
7: 24 15 23 24
8: 8 1 4 23
9: 17 4 12 23
10: 16 11 23 23
11: 20 19 21 23
12: 19 22 22 22
13: 11 2 18 21
14: 15 9 19 21
15: 10 14 9 21
16: 17 15 19 2
here's what's happening in the data.table call:
data.table(df)[ # data.table of df
,.SD[ # for each by=X4, .SD is the sub-table
order(X2)][1:4,], # first four entries ordered by X2
by="X4"][ # X4 is the grouping variable
!is.na(X3)][ # filter out NAs (i.e. less than 4 entries per row)
,list(X1,X2,X3,X4)] # order the columns

I think that Thomas's solution is fine, but can be improved. I would guess that the splitting, recombining, and reordering might be time consuming.
Instead, I would create a vector from which we can subset.
This is easily done with ave and should work since the data are already ordered.
Continuing from:
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
we can do:
out <- odf[ave(odf$X4, odf$X4, FUN = seq_along) <= 4, ]
head(out)
# X1 X2 X3 X4
# 24 3 4 13 25
# 6 23 5 13 25
# 19 9 11 24 25
# 40 10 13 11 25
# 93 16 2 25 24
# 26 10 11 13 24
tail(out)
# X1 X2 X3 X4
# 61 23 7 13 2
# 2 9 9 5 2
# 17 18 18 16 2
# 67 12 1 1 1
# 52 22 14 24 1
# 9 16 24 6 1
Update: New alternatives and benchmarks
The "dplyr" package would be great for this, and the syntax is pretty compact. But first, let's set some things up to see how fast these options are:
Functions to benchmark
fun1 <- function() {
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
out <- do.call(rbind, lapply(split(odf, odf$X4), function(z) head(z[order(z$X2),],4) ))
out[order(out$X4, decreasing=TRUE),]
}
fun2 <- function() {
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
odf[ave(odf$X4, odf$X4, FUN = seq_along) <= 4, ]
}
fun3 <- function() {
DT <- data.table(df)
DT[, X := -X4]
setkey(DT, X, X2)
DT[, .SD[sequence(min(.N, 4))], by = X][, X:=NULL][]
}
fun4 <- function() {
group_by(arrange(df, desc(X4), X2), X4) %.%
mutate(vals = seq_along(X4)) %.%
filter(vals <= 4)
}
A bigger version of your sample data
set.seed(1)
df <- data.frame(matrix(sample(0:1000, 1000000 * 4, replace = TRUE), ncol = 4))
The necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
The first two approaches (Thomas's and my first approach) take a fair amount of time, so instead of benchmarking, I'll just time them once.
system.time(fun1())
# user system elapsed
# 6.645 0.007 6.670
system.time(fun2())
# user system elapsed
# 4.053 0.004 4.186
Here's the "dplyr" and "data.table" results.
microbenchmark(fun3(), fun4(), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun3() 2.157956 2.221746 2.303286 2.343951 2.392391 20
# fun4() 1.169212 1.180780 1.194994 1.206651 1.369922 20
Compare the output of the "dplyr" and "data.table" approaches:
out_DT <- fun3()
out_DP <- fun4()
out_DT
# X1 X2 X3 X4
# 1: 340 0 708 1000
# 2: 144 1 667 1000
# 3: 73 2 142 1000
# 4: 79 2 826 1000
# 5: 169 0 870 999
# ---
# 4000: 46 4 2 1
# 4001: 88 0 809 0
# 4002: 535 0 522 0
# 4003: 75 3 234 0
# 4004: 983 3 492 0
head(out_DP, 5)
# Source: local data frame [5 x 5]
# Groups: X4
#
# X1 X2 X3 X4 vals
# 1 340 0 708 1000 1
# 2 144 1 667 1000 2
# 3 73 2 142 1000 3
# 4 79 2 826 1000 4
# 5 169 0 870 999 1
tail(out_DP, 5)
# Source: local data frame [5 x 5]
# Groups: X4
#
# X1 X2 X3 X4 vals
# 4000 46 4 2 1 4
# 4001 88 0 809 0 1
# 4002 535 0 522 0 2
# 4003 75 3 234 0 3
# 4004 983 3 492 0 4

I include your code again with a set.seed call, so that this is exactly reproducible.
set.seed(1)
m0 <- matrix(0, 100, 4)
df <- data.frame(apply(m0, c(1,2), function(x) sample(c(0:25),1)))
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
Here's the code you need using a split-apply-combine strategy:
out <- do.call(rbind, lapply(split(odf, odf$X4), function(z) head(z[order(z$X2),],4) ))
out <- out[order(out$X4, decreasing=TRUE),]
Result:
> dim(out)
[1] 79 4
> head(out)
X1 X2 X3 X4
25.24 3 4 13 25
25.6 23 5 13 25
25.19 9 11 24 25
25.40 10 13 11 25
24.93 16 2 25 24
24.26 10 11 13 24

R : data frame Randomize columns by row

I have a dataframe in R that I want to randomize, keeping the first column like it is but randomizing the last two columns together, so that values that appear in the same rows in these columns will appear in the same row both after randomizing. So if I started with this:
1 a b c
2 d e f
3 g h i
when randomized it might look like:
1 a e f
2 d h i
3 g b c
I know that sample works fine but does it conserve the columns equivalence?

> t <- data.frame(matrix(nrow=4,ncol=10,data=1:40))
> t
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 5 9 13 17 21 25 29 33 37
2 2 6 10 14 18 22 26 30 34 38
3 3 7 11 15 19 23 27 31 35 39
4 4 8 12 16 20 24 28 32 36 40
> columns_to_random <- c(8,9,10)
> t[,columns_to_random] <- t[sample(1:nrow(t),size=nrow(t)), columns_to_random]
> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 5 9 13 17 21 25 32 36 40
2 2 6 10 14 18 22 26 29 33 37
3 3 7 11 15 19 23 27 30 34 38
4 4 8 12 16 20 24 28 31 35 39

Just sample one column at a time and you'll be fine. For example:
data[,2] = sample(data[,2])
data[,3] = sample(data[,3])
...
If you have many columns, you can extend this like:
data[,-1] = apply(data[,-1], 2, sample)
EDIT: With your clarification about row equivalence, this is just:
data[,-1] = data[sample(nrow(data)),-1]

What do you mean by "values equivalence"?
Honestly I do not get the message, but here's my guess. As you said, you could use sample, but use it separately on the on your columns, e.g. by apply:
# create a reproducible example
test <- data.frame(indx=c(1,2,3),col1=c("a","d","g"),
col2=c("b","e","h"),col3=c("c","f","i"))
xyz <- apply(test[,-1],MARGIN=2,sample)
as.data.frame(xyz)

Approach using colwise in plyr for elegant column wise permutation:
test <- data.frame(matrix(nrow=4,ncol=10,data=1:40))
Load plyr
require(plyr)
Creat a column wise "sample" function
colwise.sample <- colwise(sample)
Apply to the desired rows
permutation.test <- test
permutation.test[,c(1,3,4)] <- colwise.sample(test[,c(1,3,4)])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to output duplicated rows - r

An alternative using ave: dat[ave(dat[,1], dat[-4], FUN=length) > 1,] # x1 x2 x3 x4 #1 34 14 45 53 #2 2 8 18 17 #3 34 14 45 20 #5 2 8 18 5

Learned this one the other day. You won't need to re-order the output. s <- split(dat, do.call(paste, dat[-4])) Reduce(rbind, Filter(function(x) nrow(x) > 1, s)) # x1 x2 x3 x4 # 2 2 8 18 17 # 5 2 8 18 5 # 1 34 14 45 53 # 3 34 14 45 20

first one similar as above, let z be your data.frame: library(DescTools) (zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) ) # now aggregate aggregate(zz[, 4], zz[, -4], FUN=sum) # use Sort again, if needed...

Related

How to iteratively sum the diagonals of an incidence table in R

find duplicated rows of a data frame in R [duplicate]

How to reference the values of different columns in a dataframe depending upon the data in them

Subsetting top 4 observations of each unique ID

R : data frame Randomize columns by row

Categories

Resources