finding unique vector elements in a list efficiently - r

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)

As per #JoshuaUlrich and #thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)

You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"

Related

Why is the for loop returning NA vectors in some positions (in R)?

Following a youtube tutorial, I have created a vector x [-3,6,2,5,9].
Then I create an empty variable of length 5 with the function 'numeric(5)'
I want to store the squares of my vector x in 'Storage2' with a for loop.
When I do the for loop and update my variable, it returns a very strange thing:
[1] 9 4 0 9 25 36 NA NA 81
I can see all numbers in x have been squared, but the order is so random, and there's more than 5.
Also, why are there NAs?? If it's because the last number of x is 9 (and so this number defines the length??), and there's no 7 and 8 position, I would understand, but then I'm also missing positions 1, 3 and 4, so there should be more NAs...
I'm just starting with R, so please keep it simple, and correct me if I'm wrong during my thought process! Thank you!!
x <- c(-3,6,2,5,9)
Storage2 <- numeric(5)
for(i in x){
Storage2[i] <- i^2
}
Storage2
# [1] 9 4 0 9 25 36 NA NA 81
You're looping over the elements of x not over the positions as probably intended. You need to change your loop like so:
for(i in 1:length(x)) {
Storage2[i] <- x[i]^2
}
Storage2
# [1] 9 36 4 25 81
(Note: 1:length(x) can also be expressed as seq_along(x), as pointed out by #NelsonGon in comments and might be faster.)
However, R is a vectorized language so you can simply do that:
Storage2 <- x^2
Storage2
# [1] 9 36 4 25 81

unable to fill a data frame using column names and row names

I have data frame "x" like this :
meme webId timeStamp
2501 68814 281322.1
2501 2679 305813.0
2501 948 306025.6
I want to use "meme" and "webId" as row and column names and timeStamp as element in "mat" data frame. I wrote this:
cols<-unique(x[,"webId"])
rows<-unique(x[,"meme"])
mat<-data.frame(matrix(data=9999999,nrow=length(rows),ncol=length(cols)))
colnames(mat)<-c(cols)
rownames(mat)<-c(rows)
for(i in 1:length(x))
mat[rownames(mat)==x[i,"meme"],colnames(mat)==x[i,"webId"]]<-x[i,"timeStamp"]
but nothing changed. what is the problem?
please help me!!!
In the for loop, it seems that you mean to iterate over all the rows in x, and fill all the values into mat one by one. Instead you only iterate over 3 rows. length(x) gives the number of columns not the number of rows. This is the correct code for iterating over all rows:
for(i in 1:nrow(x))
mat[rownames(mat)==x[i,"meme"],colnames(mat)==x[i,"webId"]]<-x[i,"timeStamp"]
I suspect that the x dataframe contains more values than the ones you posted. In your example, the number of rows equals the number of columns, that's why the commenters couldn't find a problem with it. The problem is not evident in your example.
You could get the 'row/column' index by using match, cbind it and assign the 'timeStamp' elements to the positions specified by the index in 'mat'.
mat[cbind(match(x$meme, rownames(mat)),
match(x$webId, colnames(mat)))] <- x$timeStamp
mat
# 428 2679 68814 948
#2505 13 11 8 3
#2510 16 6 14 1
#2501 7 4 5 10
#2508 12 2 9 15
Checking with the results from the for loop
for(i in 1:nrow(x))
mat1[rownames(mat1)==x[i,"meme"],
colnames(mat1)==x[i,"webId"]]<-x[i,"timeStamp"]
mat1
# 428 2679 68814 948
#2505 13 11 8 3
#2510 16 6 14 1
#2501 7 4 5 10
#2508 12 2 9 15
Benchmarks
set.seed(21)
x1 <- data.frame(meme= rep(sample(1000), each=200),
webId= rep(sample(35000, 200, replace=FALSE), 1000),
timeStamp=rnorm(1000*200))
set.seed(324)
mat2 <- matrix(, 1000, 200,
dimnames=list(sample(unique(x1$meme)),sample(unique(x1$webId))))
mat3 <- mat2
system.time({
mat2[cbind(match(x1$meme, rownames(mat2)),
match(x1$webId, colnames(mat2)))] <- x1$timeStamp
})
# user system elapsed
# 0.181 0.001 0.181
system.time({
for(i in 1:nrow(x1))
mat3[rownames(mat3)==x1[i,"meme"],
colnames(mat3)==x1[i,"webId"]]<-x1[i,"timeStamp"]
})
# user system elapsed
#172.588 10.445 183.062
identical(mat2, mat3)
#[1] TRUE
data
set.seed(24)
x <- data.frame(meme=rep(c(2501, 2505, 2508, 2510), each=4),
webId= rep(c(68814, 2679, 948, 428), 4), timeStamp= sample(16))
set.seed(33)
mat <- matrix(, 4, 4, dimnames=list(sample(unique(x$meme)),
sample(unique(x$webId))))
mat1 <- mat

Calculate mean of each n-rows in a dataframe in r when the first row is varying

First make some example data:
df = data.frame(matrix(rnorm(200), nrow=100))
df1=data.frame(t(c(25,34)))
The starting row is different in each column. For example, in X1 I would like to start from 25 th row while in X2 from row 34. Then, I want to calculate the mean for each 5 values for the next 50 rows for all the columns in df.
I am new to R so this is probably very obvious. Can anyone provide some suggestions that how I can do this?
You could try Map.
lst <- Map(function(x,y) {x1 <- x[y:length(x)]
tapply(x1,as.numeric(gl(length(x1), 5,
length(x1))), FUN=mean)},
df, df1)
lst
# $X1
# 1 2 3 4 5 6
#-0.16500158 0.11339623 -0.86961872 -0.54985564 0.19958461 0.35234983
# 7 8 9 10 11 12
#0.32792769 0.65989801 -0.30409184 -0.53264725 -0.45792792 -0.59139844
# 13 14 15 16
# 0.03934133 -0.38068187 0.10100007 1.21017392
#$X2
# 1 2 3 4 5 6
# 0.24525622 0.07367300 0.18733973 -0.43784202 -0.45756095 -0.45740178
# 7 8 9 10 11 12
#-0.54086152 0.10439072 0.65660937 0.70623380 -0.51640088 0.46506135
# 13 14
#-0.09428336 -0.86295101
Because of the length difference, it might be better to keep it as a list. But, if you need it in a matrix/data.frame, you can make the lengths equal by padding with NAs.
do.call(cbind,lapply(lst, `length<-`,(max(sapply(lst, length)))))
Update
If you need only 50 rows, then change y:(length(x) to y:(y+49) in the Map code
data
set.seed(24)
df <- data.frame(matrix(rnorm(200), nrow=100))
df1 <- data.frame(t(c(25,34)))
Not entirely clear, especially, the second line of your code, but I think this might be close to what you want to do:
every_fifth_row <- df[seq(1, nrow(df), 5), ]
every_fifth_row
# X1 X2
# 1 -0.09490455 -0.28417104
# 6 -0.14949662 0.12857284
# 11 0.15297366 -0.84428186
# 16 -1.03397309 0.04775516
# 21 -1.95735213 -1.03750794
# 26 1.61135194 1.10189370
# 31 0.12447365 1.80792719
# 36 -0.92344017 0.66639710
# 41 -0.88764143 0.10858376
# 46 0.27761464 0.98382526
# 51 -0.14503359 -0.66868956
# 56 -1.70208187 0.05993688
# 61 0.33828525 1.00208639
# 66 -0.41427863 1.07969341
# 71 0.35027994 -1.46920059
# 76 1.38943839 0.01844205
# 81 -0.81560917 -0.32133221
# 86 1.38188423 -0.77755471
# 91 1.53247872 -0.98660308
# 96 0.45721909 -0.22855622
rowMeans(every_fifth_row)
colMeans(every_fifth_row)
# Alternative
# apply(every_fifth_row, 1, mean) # Row-wise mean
# apply(every_fifth_row, 2, mean) # Column-wise mean

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

How to find the difference between 2 dataframes?

I have 2 dataframes which are "exactly" the same. The difference between them is that one has 676 observations (rows) and the second has 666 observations. I don't know which of those rows are missed in a second dataframe.
Would be the easiest to me if someone can show me the code how to make a third dataframe with those 10 rows which are missed.
The name of dataframes:
- dataset1 (676)
- dataset2 (666)
Thx.
dataset1[tail(!duplicated(rbind(dataset2, dataset1)), nrow(dataset1)), ]
Here's an approach:
library(qdap)
## generate random problem
prob <- sample(1:nrow(mtcars), 1)
## remove the random problem row
mtcars2 <- mtcars[-prob, ]
## Throw it into a list of 2 dataframes so they're easier to work with
dat <- list(mtcars, mtcars2)
## Use qdap's `paste2` function to paste all columns together
dat2 <- lapply(dat, paste2)
## Find the shorter data set
wmn <- which.min(sapply(dat2, length))
## Add additional element to shorter one
dat2[[wmn]] <- c(dat2[[wmn]], NA)
## check each element of the 2 pasted data sets for equality
out <- mapply(identical, dat2[[1]], dat2[[2]])
## Which row's the problem
which(!out)[1]
which(!out)[1] == prob
If which(!out)[1] equals NA problem is in the last row.
When you start seeing FALSE that's where the problem is located.
EDIT: removed the for loop
I would say try to use merge and then look for where the merge result has NA values.
Here's an example using dummy data:
set.seed(1)
df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- df1[-sample(1:100,10),]
dim(df1)
# [1] 100 2
dim(df2)
# [1] 90 2
out <- merge(df1,df2,by='x',all.x=TRUE)
in1not2 <- which(is.na(out$y.y))
in1not2
# [1] 6 25 33 51 52 53 57 73 77 82
Then you can extract:
> df1[in1not2,]
x y
6 -0.8204684 1.76728727
25 0.6198257 -0.10019074
33 0.3876716 0.53149619
51 0.3981059 0.45018710
52 -0.6120264 -0.01855983
53 0.3411197 -0.31806837
57 -0.3672215 1.00002880
73 0.6107264 0.45699881
77 -0.4432919 0.78763961
82 -0.1351786 0.98389557

Resources