Is there an elegant way to take a vector like
x = c("a", "b", "c", "d", "e")
and make it be
x = c("b", "c", "d", "e", "a")
I did:
x = c("a", "b", "c", "d", "e")
firstVal = x[1]
x = tail(x,-1)
x[length(x)+1] = firstVal
x
[1] "b" "c" "d" "e" "a"
It works, but kinda ugly.
Elegance is a matter of taste, and de gustibus non est disputandum:
> x <- c("a", "b", "c", "d", "e")
> c(x[-1], x[1])
[1] "b" "c" "d" "e" "a"
Does the above make you happy? :)
Overkill time: You can consider my shifter or my moveMe function, both of which are part of my GitHub-only SOfun package.
Here are the relevant examples:
shifter
This is basically a head and tail approach:
## Specify how many values need to be shifted
shifter(x, 1)
# [1] "b" "c" "d" "e" "a"
shifter(x, 2)
# [1] "c" "d" "e" "a" "b"
## The direction can be changed too :-)
shifter(x, -1)
# [1] "e" "a" "b" "c" "d"
moveMe
This is fun:
moveMe(x, "a last")
# [1] "b" "c" "d" "e" "a"
## Lots of options to move things around :-)
moveMe(x, "a after d; c first")
# [1] "c" "b" "d" "a" "e"
Agreed with the matter of taste comment. My personal approach would be:
x[c(2:length(x), 1)]
I had an idea to use both head and tail that turned out to be a flop when I benchmarked.
c(tail(x, -1), head(x, 1))
I figured I'd share the results as they're informative. I also scaled up for larger vectors and the results were interesting:
x <- c("a", "b", "c", "d", "e")
gagolews <- function() c(x[-1], x[1])
senoro <- function() x[c(2:length(x), 1)]
tyler <- function() c(tail(x, -1), head(x, 1))
ananda <- function() shifter(x, 1)
user <-function(){
firstVal = x[1]
x = tail(x,-1)
x[length(x)+1] = firstVal
x
}
library(microbenchmark)
(op <- microbenchmark(
gagolews(),
senoro(),
tyler(),
ananda(),
user(),
times=100L))
## Unit: microseconds
## expr min lq median uq max neval
## gagolews() 1.400 1.867 2.333 2.799 5.132 1000
## senoro() 1.400 1.867 2.333 2.334 10.730 1000
## tyler() 37.320 39.187 40.120 41.519 135.287 1000
## ananda() 39.653 41.519 42.452 43.386 69.043 1000
## user() 24.259 25.658 26.591 27.058 1757.789 1000
Here I scaled up. I only benched on 100 replications because of the size of the vector (1 million characters).
x <- rep(c("a", "b", "c", "d", "e"), 1000000)
## Unit: milliseconds
## expr min lq median uq max neval
## gagolews() 168.9151 171.3631 179.0490 193.9604 260.5963 100
## senoro() 192.2669 203.9596 259.1366 272.5570 341.4443 100
## tyler() 237.4218 246.5368 303.5700 319.3999 347.3610 100
## ananda() 237.9610 247.2097 303.9898 318.4564 342.2518 100
## user() 225.4503 234.3431 287.8348 300.8078 319.2051 100
The one thing that could potentially be improved with the previous answers is that you have to know the location number of the item you want to move, which in longer vectors could become an issue. You could instead do something like:
x <- c("a", "b", "c", "d", "e")
new_x <- c(x[-which(x == "a")], "a")
Related
Imagine I have a vector like this one:
c("A", "B", "C", "D")
there are 4 positions. If I make a sample with size 1 I can get 1, 2, 3 or 4. What I want is to be able to subset of length 3 of that vector according its order, for example, if I get 2:
c("B", "C", "D")
If I get 3:
c("C", "D", "A")
If I get 4:
c("D","A", "B")
So that's the logic, the vector is sorted and the last elements connects with the first element when I subset.
Using seq, f gives you the desired subset for a specified vector v, of which you would like to subset l elements with a starting point at the nth position.
f <- function(v, n, l) v[seq(n - 1, length.out = l) %% length(v) + 1]
output
f(v, n = 4, l = 3)
#[1] "D" "A" "B"
f(v, n = 3, l = 4)
#[1] "C" "D" "A" "B"
f(v, n = 2, l = 5)
#[1] "B" "C" "D" "A" "B"
I think I got it!
v <- c("A", "B", "C", "D")
p <- sample(1:length(v), 1)
r <- c(v[p:length(v)])
c(r, v[!(v %in% r)])[1:3]
And the outputs:
v <- c("A", "B", "C", "D") # your vector
r <- c(v[2:length(v)])
c(r, v[!(v %in% r)])[1:3]
#> [1] "B" "C" "D"
r <- c(v[3:length(v)])
c(r, v[!(v %in% r)])[1:3]
#> [1] "C" "D" "A"
r <- c(v[4:length(v)])
c(r, v[!(v %in% r)])[1:3]
#> [1] "D" "A" "B"
Created on 2022-05-16 by the reprex package (v2.0.1)
Wrapped in a function:
f <- function(v, nth) {
r <- c(v[nth:length(v)])
return(c(r, v[!(v %in% r)])[1:3])
}
I need to generate a random sample with a multivariate normal distribution using seed(12346) with 100 columns and 5000 rows.
So far I have got this:
set.seed(12346)
Preg1 <- data.frame(MASS::mvrnorm(n=5000,mu=c(0,0,0),Sigma = diag(3)))
The above gives me three columns, how can I get 100?
I cannot figure out how to get the vector of mu with 100 zeros without typing them in and the Sigma would then be Sigma = diag(100)
You can use mu = rep(0, 100). The rep function is used to repeat values.
set.seed(12346)
ncol = 100
Preg1<-data.frame(mvrnorm(n = 5000, mu = rep(0, ncol), Sigma = diag(ncol)))
dim(Preg1)
# [1] 5000 100
The rep function is quite useful, it can be used in various ways that aren't applicable here but are good to know about:
rep(c("A", "B", "C"), times = 3)
# [1] "A" "B" "C" "A" "B" "C" "A" "B" "C"
rep(c("A", "B", "C"), times = 1:3)
# [1] "A" "B" "B" "C" "C" "C"
rep(c("A", "B", "C"), each = 3)
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
In this particular case, because your Sigma is an identity matrix, each column is actually independent. So it would be equivalent to generate each column (or even each draw) independently, which we could do either of these ways:
x = replicate(n = ncol, rnorm(5000))
dim(x)
# [1] 5000 100
z = matrix(rnorm(5000 * ncol), ncol = ncol)
dim(z)
# [1] 5000 100
I have a data.table such as:
example <- data.table(fir =c("A", "B", "C", "A","A", "B", "C"), las=c( "B", "C","B", "C", "B", "C","C"))
A B
B C
C B
A C
A B
B C
C C
Though I guess the problem is the same with a data.frame.
and I would like to get a vector as this:
A, B, B, C, C, B, A, C, A, B, B, C, C, C
That's, I want to stack every row on the left hand side...
I've tried unlist(example) but it extracts the data columnwise instead.
How can I get it?
I've also tried with apply, transposing and other strange things.
As in a matrix as well as a data.frame/data.table (though different from a matrix), data is stored column wise, you can transpose it first:
as.vector(t(example))
# [1] "A" "B" "B" "C" "C" "B" "A" "C" "A" "B" "B" "C" "C" "C"
A benchmark testing including options provided by #Sotos, #Frank and #Wen using a dummy data set:
example <- as.data.table(matrix(sample(LETTERS, 10^7, replace = T), ncol = 1000))
dim(example)
#[1] 10000 1000
library(microbenchmark)
psidom <- function() as.vector(t(example))
sotos <- function() c(t(example))
frank <- function() unlist(transpose(example), use.names = FALSE)
wen <- function() unname(unlist(data.frame(t(example))))
# data.table 1.10.4
microbenchmark(psidom(), sotos(), frank(), wen(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# psidom() 163.5993 178.9236 393.4838 198.6753 632.1086 1352.012 10
# sotos() 186.8764 188.3734 467.2117 343.1514 618.3121 1221.721 10
# frank() 3065.0988 3493.3691 5315.4451 4649.4643 5742.2399 9560.642 10
# wen() 7316.6743 8497.1409 9200.4397 9038.2834 9631.5313 11931.075 10
Another test in data.table dev version 1.10.5:
# data.table 1.10.5
psidom <- function() as.vector(t(example))
sotos <- function() c(t(example))
frank <- function() unlist(transpose(example), use.names = FALSE)
fast <- function() `attributes<-`(t(example), NULL)
microbenchmark(psidom(), sotos(), frank(), fast(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# psidom() 228.1248 246.4666 271.6772 256.9131 287.5072 354.2053 10
# sotos() 254.3512 280.2504 315.3487 322.5726 344.7125 390.3482 10
# frank() 290.5476 310.7076 374.6267 349.8021 431.8451 491.9301 10
# fast() 159.6006 167.6316 209.8363 196.8821 272.4758 281.3146 10
I have written this loop to extract the names of each element of a vector that occurs within a time interval (bin). I was wondering if I am missing a faster way to do this... I want to implement a randomization aspect to vectors that are 1000s in length and as such do not want to rely on a loop.
mydata <- structure(c(1199.91666666667, 1200.5, 1204.63333333333, 1205.5,
1206.3, 1208.73333333333, 1209.06666666667, 1209.93333333333,
1210.98333333333, 1214.56666666667, 1216.06666666667, 1216.63333333333,
1216.91666666667, 1219.13333333333, 1221.35, 1221.51666666667,
1225.35, 1225.53333333333, 1225.96666666667, 1227.61666666667,
1228.91666666667, 1230.31666666667, 1233.53333333333, 1235.8,
1237.51666666667, 1239.41666666667, 1241.6, 1247.08333333333,
1247.45, 1252.7, 1253.26666666667), .Names = c("B", "A", "B",
"E", "A", "A", "B", "G", "G", "C", "A", "D", "E", "B", "B", "E",
"E", "G", "F", "A", "C", "A", "F", "B", "A", "F", "F", "G", "F",
"G", "F"))
mydata
B A B E A A B G G C A D E B B E E
1199.917 1200.500 1204.633 1205.500 1206.300 1208.733 1209.067 1209.933 1210.983 1214.567 1216.067 1216.633 1216.917 1219.133 1221.350 1221.517 1225.350
G F A C A F B A F F G F G F
1225.533 1225.967 1227.617 1228.917 1230.317 1233.533 1235.800 1237.517 1239.417 1241.600 1247.083 1247.450 1252.700 1253.267
These represent consecutive times in seconds of events. Say we want to make our intervals 5s long. My approach is to make a vector of the beginning of each interval and then use a loop to find the names of elements occurring within that interval:
N=5
ints <- seq(mydata[1], mydata[length(mydata)], N)
out<-list()
for(i in 1:length(ints)){
out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])
}
out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
This is fine for small samples - but I can see this would get slow when dealing with very large samples that are permuted 1000s of times.
My suggestion is to use findInterval (based on an answer to this earlier question of mine):
mydata2 = c(-Inf, mydata)
ints <- seq(mydata[1], mydata[length(mydata)]+5, N)
idx = findInterval(ints-1e-10, mydata2)
out<-list()
for(i in 1:(length(ints)-1)){
out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])
}
As you can see I have to do a little tinkering with the beginning (adding a first value that is smaller than the first breakpoint, adding an epsilon). Here's the result, it is identical to yours:
> out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
In terms of speed for the example there is some improvement:
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 67.177 76.9725 85.73347 82.8035 95.866 119.890 100
mts 43.851 52.7150 62.72116 58.3130 73.007 96.099 100
alexis 75.573 86.5360 95.72593 91.4340 100.531 234.649 100
R_Yoda 2032.066 2158.4870 2303.68887 2191.3750 2281.409 8719.314 100
For larger vectors (I chose length 2000) this is clearer:
set.seed(123)
mydata = sort(runif(n = 2000, min = 0, max = 100))
names(mydata) = sample(LETTERS[1:7], size = 2000, replace = T)
mydata2 = c(-Inf, mydata)
ints2 <- seq(mydata[1], mydata[length(mydata)]+5, N)
dt <- data.table(data=mydata, names=names(mydata) )
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 804.243 846.9275 993.9957 862.0890 883.3140 7140.218 100
mts 77.439 88.8685 100.6148 100.0640 106.5955 188.466 100
alexis 187.066 204.7930 220.1689 215.5225 225.3190 299.026 100
R_Yoda 3831.348 4066.4640 4366.5382 4140.1700 4248.8635 11829.923 100
For performance reasons I am using data.table:
Edit: This solution works, but is NOT very fast (as proved by the answer of mts)
library(Hmisc)
library(data.table)
# assuming that your mydata vector from the question is loaded
N=5 # code from your question...
ints <- seq(mydata[1], mydata[length(mydata)], N) # code from your question...
dt <- data.table(data=mydata, names=names(mydata) )
dt[, groups := cut2(data,ints)] # attention: shall the interval ends be included in the group or not?
groups <- dt[ , .(result=list(names)), by=groups] # the elements of a data.table can be a list itself!
# to get the result as list:
out <- groups[,result]
out
Edit: You could replace cut2 by findInterval and do it all in one line, but it is still slower:
out <- dt[, .(result=list(names)), by = findInterval(data,ints) ]
This is the result:
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
Given a list:
foo <- list(c("a", "b", "d"), c("c", "b"), c("c"),
c("b", "d"), c("e", "f"), c("e", "g"))
what is an efficient way to get a list that contains the disjoint sets of its content?
Here I want to obtain:
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
The solutions I have managed to come up with seemed overly complicated and slow (I'm working with a largish list (4000+ elements) that contain up to hundreds of elements).
Thanks!
Solutions benchmarking
Thank you all for your input. The igraph approach is really nice. I did some benchmarking of the proposed solutions and using igraph with #flodel suggestion is efficient. The example here (iGrp) has 3170 elements.
> microbenchmark(igraph_method(iGrp), igraph_method2(iGrp), iterative_method(iGrp), times=10L)
## Unit: milliseconds
## expr min lq median uq max neval
## igraph_method(iGrp) 6892.8534 7140.0287 7229.5569 7396.2458 8044.9796 10
## igraph_method2(iGrp) 381.4555 391.2097 442.3282 472.5641 537.4885 10
## iterative_method(iGrp) 7118.7857 7272.9568 7595.9700 7675.2888 8485.4388 10
#### functions used
igraph_method <- function(lst) {
edg <- do.call("rbind", lapply(lst, function(x) {
if (length(x) > 1) t(combn(x, 2)) else NULL
}))
g <- graph.data.frame(edg)
split(V(g)$name, clusters(g)$membership)
}
igraph_method2 <- function(lst) {
edg <- do.call("rbind", lapply(lst, function(x) {
if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
}))
g <- graph.data.frame(edg)
split(V(g)$name, clusters(g)$membership)
}
iterative_method <- function(lst) {
Reduce(function(l, x) {
matches <- sapply(l, function(i) any(x %in% i))
if (any(matches)) {
combined <- unique(c(unlist(l[matches]), x))
l[matches] <- NULL # Delete old entries
l <- c(l, list(combined)) # Add combined entries
} else {
l <- c(l, list(x)) # New list entry
}
l
}, lst, init=list())
}
One way to approach this sort of problem is to build a graph where nodes are the values in your list and edges are whether those values have appeared together. Then you're just asking for the connected components of that graph. The igraph package in R makes this pretty easy. First, you'll want to build a data frame with the edges:
edges <- do.call(rbind, lapply(foo, function(x) {
if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
}))
edges
# [,1] [,2]
# [1,] "a" "b"
# [2,] "b" "d"
# [3,] "c" "b"
# [4,] "b" "d"
# [5,] "e" "f"
# [6,] "e" "g"
Then, you can build your graph from the edges and compute the connected components:
library(igraph)
g <- graph.data.frame(edges, directed=FALSE)
split(V(g)$name, clusters(g)$membership)
# $`1`
# [1] "a" "b" "c" "d"
#
# $`2`
# [1] "e" "f" "g"
For reasonably large problems, this approach seems to be modestly faster than an iterative approach:
values = as.character(1:2000)
set.seed(144)
foo <- lapply(1:4000, function(x) sample(values, rbinom(1, 10, .5)))
library(microbenchmark)
microbenchmark(josilber(foo), lundberg(foo))
# Unit: milliseconds
# expr min lq median uq max neval
# josilber(foo) 251.8007 281.0168 297.2446 314.6714 635.7916 100
# lundberg(foo) 640.0575 714.9658 761.3777 827.5415 1118.3517 100
Here is an iterative approach, building a list for the result, and combining elements as they are seen together:
Reduce(function(l, x) {
matches <- sapply(l, function(i) any(x %in% i))
if (any(matches)) {
combined <- unique(c(unlist(l[matches]), x))
l[matches] <- NULL # Delete old entries
l <- c(l, list(combined)) # Add combined entries
} else {
l <- c(l, list(x)) # New list entry
}
l
}, foo, init=list())
## [[1]]
## [1] "a" "b" "d" "c"
##
## [[2]]
## [1] "e" "f" "g"