This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 6 years ago.
I have a matrix with a large number of duplicates and would like to obtain a matrix with the unique rows and a frequency count to each unique row.
The example shown below solves this problem but is painfully slow.
rowsInTbl <- function(tbl,row){
sum(apply(tbl, 1, function(x) all(x == row) ))
}
colFrequency <- function(tblall){
tbl <- unique(tblall)
results <- matrix(nrow = nrow(tbl),ncol=ncol(tbl)+1)
results[,1:ncol(tbl)] <- as.matrix(tbl)
dimnames(results) <- list(c(rownames(tbl)),c(colnames(tbl),"Frequency"))
freq <- apply(tbl,1,function(x)rowsInTbl(tblall,x))
results[,"Frequency"] <- freq
return(results)
}
m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
dimnames(m) <- list(letters[1:nrow(m)],c("c1","c2"))
print("Matrix")
print(m)
[1] "Matrix"
c1 c2
a 1 2
b 3 4
c 3 4
d 1 2
e 3 4
print("Duplicate frequency table")
print(colFrequency(m))
[1] "Duplicate frequency table"
c1 c2 Frequency
a 1 2 2
b 3 4 3
Here are the speed measurements of the answers of #Heroka and #m0h3n compared to my example. The matrix shown above was repeated 1000 times. Data.table clearly is the fastest solution.
[1] "Duplicate frequency table - my example"
user system elapsed
0.372 0.000 0.371
[1] "Duplicate frequency table - data.table"
user system elapsed
0.008 0.000 0.008
[1] "Duplicate frequency table - aggregate"
user system elapsed
0.092 0.000 0.089
Looks like a job for data.table, as you need something that can aggregate quickly.
library(data.table)
m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
mdt <- as.data.table(m)
res <- mdt[,.N, by=names(mdt)]
res
# > res
# V1 V2 N
# 1: 1 2 2
# 2: 3 4 3
How about this using base R for extracting unique rows:
mat <- matrix(c(2,5,3,5,2,3,4,2,3,5,4,2,1,5,3,5), ncol = 2, byrow = T)
mat[!duplicated(mat),]
# [,1] [,2]
# [1,] 2 5
# [2,] 3 5
# [3,] 2 3
# [4,] 4 2
# [5,] 1 5
Extracting unique rows along with their frequencies:
m <- as.data.frame(mat)
aggregate(m, by=m, length)[1:(ncol(m)+1)]
# V1 V2 V1.1
# 1 4 2 2
# 2 2 3 1
# 3 1 5 1
# 4 2 5 1
# 5 3 5 3
Related
I was wondering if there might be a way in R to distribute n among k units without repetition (e.g., 3 5 2 is the same as 5 3 2, and 2 3 5 and 5 2 3) and without considering 0 combinations (i.e., no 9 1 0) and see the make-up of this distribution?
For example if n = 9 and k = 3 then we expect the make-up to be:
(Note: k will always be the # of columns)
3 3 3
4 3 2
4 1 4
5 2 2
5 1 3
6 2 1
7 1 1
makeup <- function(n, k){
# your suggested solution #
}
These are called integer partitions (more specifically restricted integer partitions) and can efficiently be generated with the packages partitions or arrangements like so:
partitions::restrictedparts(9, 3, include.zero = FALSE)
[1,] 7 6 5 4 5 4 3
[2,] 1 2 3 4 2 3 3
[3,] 1 1 1 1 2 2 3
arrangements::partitions(9, 3)
[,1] [,2] [,3]
[1,] 1 1 7
[2,] 1 2 6
[3,] 1 3 5
[4,] 1 4 4
[5,] 2 2 5
[6,] 2 3 4
[7,] 3 3 3
They are much faster than the solutions thus provided:
library(microbenchmark)
microbenchmark(arrangePack = arrangements::partitions(20, 5),
partsPack = partitions::restrictedparts(20, 5, include.zero = FALSE),
myfun2(20, 5, 20),
myfun1(20, 5, 20),
makeup(20, 5),
mycomb(20, 5), times = 3, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
arrangePack 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
partsPack 3.070203 2.755573 2.084231 2.553477 1.854912 1.458389 3
myfun2(20, 5, 20) 10005.679667 8528.784033 6636.284386 7580.133387 5852.625112 4872.050067 3
myfun1(20, 5, 20) 12770.400243 10574.957696 8005.844282 9164.764625 6897.696334 5610.854109 3
makeup(20, 5) 15422.745155 12560.083171 9248.916738 10721.316721 7812.997976 6162.166646 3
mycomb(20, 5) 1854.125325 1507.150003 1120.616461 1284.278219 950.015812 760.280469 3
In fact, for the example below, the other functions will error out because of memory:
system.time(arrangements::partitions(100, 10))
user system elapsed
0.068 0.031 0.099
arrangements::npartitions(100, 10)
[1] 2977866
You may try gtools::combinations for this work like below with repeats.allowed=TRUE option:
m <- gtools::combinations(9, 3, repeats.allowed = TRUE)
m[rowSums(m) == 9,]
A probable function could be, with options(expressions = 500000), this function could go till n = 500 (successfully ran on my machine for n=500, r=3):
mycomb <- function(n, r, sumval){
m <- combinations(n, r, repeats.allowed = TRUE)
m[rowSums(m) == sumval,]
}
mycomb(9,3,9)
Output:
# [,1] [,2] [,3]
#[1,] 1 1 7
#[2,] 1 2 6
#[3,] 1 3 5
#[4,] 1 4 4
#[5,] 2 2 5
#[6,] 2 3 4
#[7,] 3 3 3
Here's a base solution using expand.grid. I'm not going to recommend it for large n, but it works:
makeup <- function(n, k) {
x <- expand.grid(rep(list(1:n), 3)) # generate all combinations
x <- x[rowSums(x) == n,] # filter out stuff that doesn't sum to n
x <- as.data.frame(t(apply(x, 1, sort))) # order everything
unique(x) # keep non-duplicates
}
A little rethinking simplifies this greatly. If we have a vector of n objects, we can break it apart at n-1 different spots.. starting from this, we can reduce the work substantially:
makeup <- function(n, k) {
splits <- combn(n-1, k-1) # locations where to split up the data
bins <- rbind(rep(0, ncol(splits)), splits) # add an extra "split" before the 1st element
x <- apply(bins, 2, function(x) c(x[-1],9) -x) # count how many items in each bin
x <- as.data.frame(t(apply(x, 2, sort))) # order everything
unique(x) # keep non-duplicates
}
using matrix in base R:
myfun1 <- function( n, k){
x <- as.matrix(expand.grid( rep(list(seq_len(n)), k)))
x <- x[rowSums(x) == n,]
x[ ! duplicated( t( apply(x, 1, sort)) ),]
}
myfun1( n = 9, k = 3 )
May be this using data.table.
myfun2 <- function( n, k){
require('data.table')
dt <- do.call(CJ, rep(list(seq_len(n)), k))
dt <- dt[rowSums(dt) == n,]
dt[which(!duplicated(dt[, transpose(lapply( transpose(.SD), sort ))])),]
}
myfun2( n = 9, k = 3 )
# V1 V2 V3
# 1: 7 1 1
# 2: 6 2 1
# 3: 5 3 1
# 4: 4 4 1
# 5: 5 2 2
# 6: 4 3 2
# 7: 3 3 3
I have a 2D matrix mat with 500 rows × 335 columns, and a data.frame dat with 120425 rows. The data.frame dat has two columns I and J, which are integers to index the row, column from mat. I would like to add the values from mat to the rows of dat.
Here is my conceptual fail:
> dat$matval <- mat[dat$I, dat$J]
Error: cannot allocate vector of length 1617278737
(I am using R 2.13.1 on Win32). Digging a bit deeper, I see that I'm misusing matrix indexing, as it appears that I'm only getting a sub-matrix of mat, and not a single-dimension array of values as I expected, i.e.:
> str(mat[dat$I[1:100], dat$J[1:100]])
int [1:100, 1:100] 20 1 1 1 20 1 1 1 1 1 ...
I was expecting something like int [1:100] 20 1 1 1 20 1 1 1 1 1 .... What is the correct way to index a 2D matrix using indices of row, column to get the values?
Almost. Needs to be offered to "[" as a two column matrix:
dat$matval <- mat[ cbind(dat$I, dat$J) ] # should do it.
There is a caveat: Although this also works for dataframes, they are first coerced to matrix-class and if any are non-numeric, the entire matrix becomes the "lowest denominator" class.
Using a matrix to index as DWin suggests is of course much cleaner, but for some strange reason doing it manually using 1-D indices is actually slightly faster:
# Huge sample data
mat <- matrix(sin(1:1e7), ncol=1000)
dat <- data.frame(I=sample.int(nrow(mat), 1e7, rep=T),
J=sample.int(ncol(mat), 1e7, rep=T))
system.time( x <- mat[cbind(dat$I, dat$J)] ) # 0.51 seconds
system.time( mat[dat$I + (dat$J-1L)*nrow(mat)] ) # 0.44 seconds
The dat$I + (dat$J-1L)*nrow(m) part turns the 2-D indices into 1-D ones. The 1L is the way to specify an integer instead of a double value. This avoids some coercions.
...I also tried gsk3's apply-based solution. It's almost 500x slower though:
system.time( apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat ) ) # 212
Here's a one-liner using apply's row-based operations
> dat <- as.data.frame(matrix(rep(seq(4),4),ncol=2))
> colnames(dat) <- c('I','J')
> dat
I J
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
> mat <- matrix(seq(16),ncol=4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> dat$K <- apply( dat, 1, function(x,mat) mat[ x[1], x[2] ], mat=mat )
> dat
I J K
1 1 1 1
2 2 2 6
3 3 3 11
4 4 4 16
5 1 1 1
6 2 2 6
7 3 3 11
8 4 4 16
n <- 10
mat <- cor(matrix(rnorm(n*n),n,n))
ix <- matrix(NA,n*(n-1)/2,2)
k<-0
for (i in 1:(n-1)){
for (j in (i+1):n){
k <- k+1
ix[k,1]<-i
ix[k,2]<-j
}
}
o <- rep(NA,nrow(ix))
o <- mat[ix]
out <- cbind(ix,o)
I need to find the row-wise minimum of many (+60) relatively large data.frame (~ 250,000 x 3) (or I can equivalently work on an xts).
set.seed(1000)
my.df <- sample(1:5, 250000*3, replace=TRUE)
dim(my.df) <- c(250000,3)
my.df <- as.data.frame(my.df)
names(my.df) <- c("A", "B", "C")
The data frame my.df looks like this
> head(my.df)
A B C
1 2 5 2
2 4 5 5
3 1 5 3
4 4 4 3
5 3 5 5
6 1 5 3
I tried
require(data.table)
my.dt <- as.data.table(my.df)
my.dt[, row.min:=0] # without this: "Attempt to add new column(s) and set subset of rows at the same time"
system.time(
for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)]
)
On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use data.table.
Am I using data.table correctly? Is there a more efficient
way to do simple row-wise opertations?
Or, just pmin.
my.dt <- as.data.table(my.df)
system.time(my.dt[,row.min:=pmin(A,B,C)])
# user system elapsed
# 0.02 0.00 0.01
head(my.dt)
# A B C row.min
# [1,] 2 5 2 2
# [2,] 4 5 5 4
# [3,] 1 5 3 1
# [4,] 4 4 3 3
# [5,] 3 5 5 3
# [6,] 1 5 3 1
After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark:
pmin (Matt Dowle's answer above), below as tm1
apply (Andrie's answer above), below as tm2
melting first, then min by group, below as tm3
so:
library(microbenchmark); library(data.table)
set.seed(1000)
b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric())
for (m in c(2.5,100)*1e5){
for (n in c(3,50)){
my.df <- sample(1:5, m*n, replace=TRUE)
dim(my.df) <- c(m,n)
my.df <- as.data.frame(my.df)
names(my.df) <- c(LETTERS,letters)[1:n]
my.dt <- as.data.table(my.df)
tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6
my.dt <- as.data.table(my.df)
tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6
my.dt <- as.data.table(my.df)sv
tm3 <- mean(microbenchmark(
melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id],
times=30L
)$time)/1e6
b <- rbind(b, data.table(m, n, tm1, tm2, tm3) )
}
}
(I ran out of time to try more combinations) gives us:
b
# m n tm1 tm2 tm3
# 1: 2.5e+05 3 16.20598 1000.345 39.36171
# 2: 2.5e+05 50 166.60470 1452.239 588.49519
# 3: 1.0e+07 3 662.60692 31122.386 1668.83134
# 4: 1.0e+07 50 6594.63368 50915.079 17098.96169
c <- melt(b, id.vars=c('m','n'))
library(ggplot2)
ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() +
ylab('Runtime (millisec)') + xlab('# of rows') +
guides(linetype=guide_legend(title='Number of columns'))
Although I knew apply (tm2) would scale poorly, I am surprised that pmin (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where pmin shouldn't be used over melt-min-by-group (tm3).
The classical way of doing row-wise operations in R is to use apply:
apply(my.df, 1, min)
> head(my.df)
A B C min
1 2 5 4 2
2 4 3 1 1
3 1 1 5 1
4 4 1 5 1
5 3 3 4 3
6 1 1 1 1
On my machine, this operation takes about 0.25 of a second.
I want to column bind (cbind) mydf[,"c"] and give it a new name newcolumn in one step and get the result matrix mydf. How do I do it?
mydf
# a b c
# 1 2 6
# 1 3 4
mydf
# a b c newcolumn
# 1 2 6 6
# 1 3 4 4
You can specify the new column name in the call to cbind:
mydf <- cbind(mydf, newcolumn=mydf[,"c"])
mydf
# a b c newcolumn
# [1,] 1 2 6 6
# [2,] 1 3 4 4
Data (constructed with the same approach):
mydf <- cbind(a=c(1, 1), b=c(2, 3), c=c(6, 4))
If you had a data frame instead of a matrix, you could simply do mydf$newcolumn <- mydf$c.
There are many approaches you could take here:
mydf <- data.frame(a=c(1,1),b=c(2,3),c=c(6,4));
mydf;
## a b c
## 1 1 2 6
## 2 1 3 4
data.frame(mydf,newcolumn=mydf$c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
cbind(mydf,newcolumn=mydf$c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
transform(mydf,newcolumn=c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
within(mydf,newcolumn <- c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
mydf$newcolumn <- mydf$c;
mydf;
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
Of the 5 approaches shown above, only the last actually modifies mydf. For the other 4, you have to assign mydf to the return value to replace it with the new data.frame that has the additional column.
library('microbenchmark');
bind.df <- function() mydf <- data.frame(mydf,newcolumn=mydf$c);
bind.cb <- function() mydf <- cbind(mydf,newcolumn=mydf$c);
bind.tr <- function() mydf <- transform(mydf,newcolumn=c);
bind.wi <- function() mydf <- within(mydf,newcolumn <- c);
bind.as1 <- function() mydf$newcolumn <- mydf$c;
bind.as2 <- function() mydf['newcolumn'] <- mydf['c'];
bind.as3 <- function() mydf[,'newcolumn'] <- mydf[,'c'];
bind.as4 <- function() mydf[['newcolumn']] <- mydf[['c']];
N <- 1e5; mydf <- data.frame(a=rep(c(1,1),N),b=rep(c(2,3),N),c=rep(c(6,4),N));
microbenchmark(bind.df(),bind.cb(),bind.tr(),bind.as1(),bind.as2(),bind.as3(),bind.as4(),times=1e4);
## Unit: microseconds
## expr min lq mean median uq max neval
## bind.df() 97.077 112.046 128.66080 121.027 134.711 1690.513 10000
## bind.cb() 86.814 100.927 117.14364 109.907 122.737 1849.172 10000
## bind.tr() 105.203 120.171 138.90802 131.290 145.830 1680.250 10000
## bind.as1() 12.402 20.100 23.35085 22.239 25.660 148.397 10000
## bind.as2() 370.776 412.686 596.47901 425.088 449.036 41799.239 10000
## bind.as3() 347.682 385.743 564.78320 396.435 419.528 42144.355 10000
## bind.as4() 17.534 26.087 30.09639 28.654 32.930 638.915 10000
If there are two columns and you would like to add two column by column then use cbind in dataframe type.
dapu <- cbind(data.frame(data_r), data.frame(data_c))
I have a data frame with list of X/Y locations (>2000 rows). What I want is to select or find all the rows/locations based on a max distance. For example, from the data frame select all the locations that are between 1-100 km from each other. Any suggestions on how to do this?
You need to somehow determine the distance between each pair of rows.
The simplest way is with a corresponding distance matrix
# Assuming Thresh is your threshold
thresh <- 10
# create some sample data
set.seed(123)
DT <- data.table(X=sample(-10:10, 5, TRUE), Y=sample(-10:10, 5, TRUE))
# create the disance matrix
distTable <- matrix(apply(createTable(DT), 1, distance), nrow=nrow(DT))
# remove the lower.triangle since we have symmetry (we don't want duplicates)
distTable[lower.tri(distTable)] <- NA
# Show which rows are above the threshold
pairedRows <- which(distTable >= thresh, arr.ind=TRUE)
colnames(pairedRows) <- c("RowA", "RowB") # clean up the names
Starting with:
> DT
X Y
1: -4 -10
2: 6 1
3: -2 8
4: 8 1
5: 9 -1
We get:
> pairedRows
RowA RowB
[1,] 1 2
[2,] 1 3
[3,] 2 3
[4,] 1 4
[5,] 3 4
[6,] 1 5
[7,] 3 5
These are the two functions used for creating the distance matrix
# pair-up all of the rows
createTable <- function(DT)
expand.grid(apply(DT, 1, list), apply(DT, 1, list))
# simple cartesian/pythagorean distance
distance <- function(CoordPair)
sqrt(sum((CoordPair[[2]][[1]] - CoordPair[[1]][[1]])^2, na.rm=FALSE))
I'm not entirely clear from your question, but assuming you mean you want to take each row of coordinates and find all the other rows whose coordinates fall within a certain distance:
# Create data set for example
set.seed(42)
x <- sample(-100:100, 10)
set.seed(456)
y <- sample(-100:100, 10)
coords <- data.frame(
"x" = x,
"y" = y)
# Loop through all rows
lapply(1:nrow(coords), function(i) {
dis <- sqrt(
(coords[i,"x"] - coords[, "x"])^2 + # insert your preferred
(coords[i,"y"] - coords[, "y"])^2 # distance calculation here
)
names(dis) <- 1:nrow(coords) # replace this part with an index or
# row names if you have them
dis[dis > 0 & dis <= 100] # change numbers to preferred threshold
})
[[1]]
2 6 7 9 10
25.31798 95.01579 40.01250 30.87070 73.75636
[[2]]
1 6 7 9 10
25.317978 89.022469 51.107729 9.486833 60.539243
[[3]]
5 6 8
70.71068 91.78780 94.86833
[[4]]
5 10
40.16217 99.32774
[[5]]
3 4 6 10
70.71068 40.16217 93.40771 82.49242
[[6]]
1 2 3 5 7 8 9 10
95.01579 89.02247 91.78780 93.40771 64.53681 75.66373 97.08244 34.92850
[[7]]
1 2 6 9 10
40.01250 51.10773 64.53681 60.41523 57.55867
[[8]]
3 6
94.86833 75.66373
[[9]]
1 2 6 7 10
30.870698 9.486833 97.082439 60.415230 67.119297
[[10]]
1 2 4 5 6 7 9
73.75636 60.53924 99.32774 82.49242 34.92850 57.55867 67.11930