Simple Addition using sapply in R - r

I am trying to use sapply to add 0.84 until 1.7 is reached, so that I can avoid using a for loop.
What I already have tried:
my_vector2 <- sapply(-2.5:1.7, function(x) x + 0.84)
I am expecting to see -1.66, -0.82, 0.02, 0.86, 1.7 but the output is -1.66 -0.66 0.34 1.34 2.34.
What am I missing?

seq() can do what you want:
> seq(-2.5,1.7,by = .84)[-1]
[1] -1.66 -0.82 0.02 0.86 1.70
The point of the [-1] is to throw away the first number, -2.5. With round-off error, you might need to be careful with the final number as well. Type ?seq at the prompt for additional information.

Related

Fill out multi-dimensional array using Julia

I am trying to fill out a multi-dimensional array. For example, X[1] is a vector contains all values k*h[1] where k=0,...,floor(Int,15/h[i])+1. I didn't arrive to solve this problem.
`h=[0.01 0.02 0.04 0.08 0.1 0.2 0.5 0.8]
X=[k*h[i] for k in 0:floor(Int,15/h[i])+1 for i in 1:8]`
I got this error
You can get a Vector of Vectors instead of a 2D Array as noted by #PrzemyslawSzufel. This is because X[1] has length that's different from X[2], that's different from X[3], etc.
The real error, then, is that i is not defined. The variable i is defined in the second comprehension, for i in 1:8, but, it's not accessible in the first comprehension. To solve this, we can insert the first comprehension inside 2 brackets to make it as one element of the second comprehension. Finally, we'll get X[1] as a vector containing all values k*h[1], etc.
Note: In Julia, h = [0.01 0.02 ...] is a Matrix{Float64} and not a Vector. So, you should use h = [0.01, 0.02, ...] instead for Vectors. Also, the range 0:floor(Int,15/h[i])+1 is similar to 0:15/h[i]+1 because the default step is 1, except that the second range has Float64 values.
h = [0.01, 0.02, 0.04, 0.08, 0.1, 0.2, 0.5, 0.8];
X = [[k * h[i] for k in 1:15/h[i]+1] for i in 1:8];
X[1]
1501-element Vector{Float64}:
0.01
0.02
0.03
0.04
0.05
0.06
0.07
...

Two-step cluster

I have a dataset in this format:
structure(list(id = 1:4, var1_before = c(-0.16, -0.31, -0.26,
-0.77), var2_before = c(-0.7, -1.06, -0.51, -0.81), var3_before = c(2.47,
2.97, 2.91, 3.01), var4_before = c(-1.08, -1.22, -0.92, -1.16
), var5_before = c(0.54, 0.4, 0.46, 0.79), var1_after = c(-0.43,
-0.18, -0.59, 0.64), var2_after = c(-0.69, -0.38, -1.19, -0.77
), var3_after = c(2.97, 3.15, 3.35, 1.52), var4_after = c(-1.11,
-0.99, -1.26, -0.39), var5_after = c(1.22, 0.41, 1.01, 0.24)), class = "data.frame", row.names = c(NA,
-4L))
Every id is unique.
I would like to make two clusters:
First cluster for variables: var1_before, var2_before, var3_before, var4_before, var5_before
Second cluster for variables: var1_after, var2_after, var3_after, var4_after, var5_after
I used two-step cluster in spss for this.
How is it possible to make it in R?
This question is quite complex, this is how I'd approach the problem, hoping to help and maybe to start a discussion about it.
Note:
this is how I think I could approach the problem;
I do not know the two-step clustering, so I use a kmeans;
it's based on your data, but you can easily generalize it: I've made it dependent to your data because it's simpler to explain.
So, you create the first clustering with the before variables, then the value of the variable changes (after variables), and you want to see if the id are in the same cluster.
This leads me to think that you only need the first set of clusters (for the before variables), then see if the ids have changed: no need to do a second clustering, but only see if they've changed from the one cluster to another.
# first, you make your model of clustering, I'm using a simple kmeans
set.seed(1234)
model <- kmeans(df[,2:6],2)
# you put the clusters in the dataset
df$before_cluster <- model$cluster
Now the idea is to calculate the Euclidean distance from the ids with the new variables (after variables), to the centroids calculated on the before variabiles:
# for the first cluster
cl1 <- list()
for (i in 1:nrow(df)) {
cl1[[i]] <- dist(rbind(df[i,7:11], model$centers[1,] ))
}
cl1 <- do.call(rbind, cl1)
colnames(cl1) <- 'first'
# for the second cluster
cl2 <- list()
for (i in 1:nrow(df)) {
cl2[[i]] <- dist(rbind(df[i,7:11], model$centers[2,] ))
}
cl2 <- do.call(rbind, cl2)
colnames(cl2) <- 'second'
# put them together
df <- cbind(df, cl1, cl2)
Now the last part, you can define if one has changed the cluster, getting the smallest distance from the centroids (smallest --> it's the new cluster), fetching the "new" cluster.
df$new_cl <- ifelse(df$first < df$second, 1,2)
df
id var1_before var2_before var3_before var4_before var5_before var1_after var2_after var3_after var4_after var5_after first second before_cluster first second new_cl
1 1 -0.16 -0.70 2.47 -1.08 0.54 -0.43 -0.69 2.97 -1.11 1.22 0.6852372 0.8151840 2 0.6852372 0.8151840 1
2 2 -0.31 -1.06 2.97 -1.22 0.40 -0.18 -0.38 3.15 -0.99 0.41 0.7331098 0.5208887 1 0.7331098 0.5208887 2
3 3 -0.26 -0.51 2.91 -0.92 0.46 -0.59 -1.19 3.35 -1.26 1.01 0.6117598 1.1180004 2 0.6117598 1.1180004 1
4 4 -0.77 -0.81 3.01 -1.16 0.79 0.64 -0.77 1.52 -0.39 0.24 2.0848381 1.5994765 1 2.0848381 1.5994765 2
Seems they all have changed cluster.

How to optimize recursive function for finding all permutations?

I wrote the following piece of code to find all permutations of a given vector:
perm <- function(v, r = NULL, P = NULL) {
l <- length(v)
if (l == 0) {
P <- rbind(P, r)
rownames(P) <- NULL
P
} else {
for (i in 1:l) {
new_r <- c(r, v[i])
new_v <- v[-i]
P <- perm(new_v, new_r, P)
}
P
}
}
P <- perm(1:9) # takes "forever" yet e.g. perm(1:7) is quite fast!?!
P
It does what it should but the problem is that it kind of runs forever if one uses vectors of length > 8 (as above).
My question
I don't really see the problem, I found some recursive implementations that don't look so different yet are much more efficient... So is there a simple way to optimize the code so that it runs faster?
As #akrun states, recursion in R is generally not that efficient. However, if you must have a recursive solution, look no further than gtools::permutations. Here is the implementation:
permGtools <- function(n, r, v) {
if (r == 1)
matrix(v, n, 1)
else if (n == 1)
matrix(v, 1, r)
else {
X <- NULL
for (i in 1:n) X <- rbind(X, cbind(v[i], permGtools(n - 1, r - 1, v[-i])))
X
}
}
By the way, to get the full source code, simply type gtools::permutations in the console and hit enter. For more information see How can I view the source code for a function?
And here are some timings:
system.time(perm(1:8))
user system elapsed
34.074 10.641 44.815
system.time(permGtools(8,8,1:8))
user system elapsed
0.253 0.001 0.255
And just for good measure:
system.time(permGtools(9, 9, 1:9))
user system elapsed
2.512 0.046 2.567
Why is the OP's implementation slower?
Skip to the summary if you don't to read the details.
For starters, we can simply see that the OP's implementation makes more recursive calls than the implementation in gtools. To show this, we add count <<- count + 1L to the top of each function (N.B. We are using the <<- assignment operator which searches through the parent environments first). E.g:
permGtoolsCount <- function(n, r, v) {
count <<- count + 1L
if (r == 1)
.
.
And now we test a few lengths:
iterationsOP <- sapply(4:7, function(x) {
count <<- 0L
temp <- permCount(1:x)
count
})
iterationsOP
[1] 65 326 1957 13700
iterationsGtools <- sapply(4:7, function(x) {
count <<- 0L
temp <- permGtoolsCount(x, x, 1:x)
count
})
iterationsGtools
[1] 41 206 1237 8660
As you can see, the OP's implementation makes more calls in every case. In fact, it makes about 1.58... times the amount of recursive calls.
iterationsOP / iterationsGtools
[1] 1.585366 1.582524 1.582053 1.581986
As we have stated already, recursion in R has a bad reputation. I couldn't find anything pinpointing exactly why this is the case other than R does not employ tail-recursion.
At this point, it seems hard to believe that making about 1.58 times more recursive calls would explain the 175 times speed up we saw above (i.e. 44.815 / 0.255 ~= 175).
We can profile the code with Rprof in order to glean more information:
Rprof("perm.out", memory.profiling = TRUE)
a1 <- perm(1:8)
Rprof(NULL)
summaryRprof("perm.out", memory = "both")$by.total
total.time total.pct mem.total self.time self.pct
"perm" 43.42 100.00 15172.1 0.58 1.34
"rbind" 22.50 51.82 7513.7 22.50 51.82
"rownames<-" 20.32 46.80 7388.7 20.30 46.75
"c" 0.02 0.05 23.7 0.02 0.05
"length" 0.02 0.05 0.0 0.02 0.05
Rprof("permGtools.out", memory.profiling = TRUE)
a2 <- permGtools(8, 8, 1:8)
Rprof(NULL)
summaryRprof("permGtools.out", memory = "tseries")$by.total
total.time total.pct mem.total self.time self.pct
"rbind" 0.34 100.00 134.8 0.18 52.94
"cbind" 0.34 100.00 134.8 0.08 23.53
"permGtools" 0.34 100.00 134.8 0.06 17.65
"matrix" 0.02 5.88 0.0 0.02 5.88
One thing that jumps out immediately (other than the time) is the huge memory usage of the OP's implementation. The OP's implementation uses roughly 15 Gb of memory whereas the gtools implementation only use 134 Mb.
Digging Deeper
In the above, we are simply looking at memory usage in a general view by setting the memory parameter to both. There is another setting called tseries that lets you look at the memory usage over time.
head(summaryRprof("perm.out", memory = "tseries"))
vsize.small vsize.large nodes duplications stack:2
0.02 4050448 25558992 49908432 2048 "perm":"perm"
0.04 98808 15220400 1873760 780 "perm":"perm"
0.06 61832 12024184 1173256 489 "perm":"perm"
0.08 45400 0 861728 358 "perm":"perm"
0.1 0 14253568 0 495 "perm":"perm"
0.12 75752 21412320 1436120 599 "perm":"perm"
head(summaryRprof("permGtools.out", memory = "tseries"))
vsize.small vsize.large nodes duplications stack:2
0.02 4685464 39860824 43891512 0 "permGtools":"rbind"
0.04 542080 552384 12520256 0 "permGtools":"rbind"
0.06 0 0 0 0 "permGtools":"rbind"
0.08 767992 1200864 17740912 0 "permGtools":"rbind"
0.1 500208 566592 11561312 0 "permGtools":"rbind"
0.12 0 151488 0 0 "permGtools":"rbind"
There is a lot going on here, but the thing to focus on is the duplications field. From the documentation for summaryRprof we have:
It also records the number of calls to the internal function duplicate in the time interval. duplicate is called by C code when arguments need to be copied.
Comparing the number of copies in each implementation:
sum(summaryRprof("perm.out", memory = "tseries")$duplications)
[1] 121006
sum(summaryRprof("permGtools.out", memory = "tseries")$duplications)
[1] 0
So we see that the OP's implementation requires many copies to be made. I guess this isn't surprising given that the desired object is a parameter in the function prototype. That is, P is the matrix of permutations that is to be returned and is constantly getting larger and larger with each iteration. And with each iteration, we are passing it along to perm. You will notice in the gtools implementation that this is not the case as it simply as two numeric values and a vector for its parameters.
Summary
So there you have it, the OP's original implementation not only makes more recursive calls, but also require many copies which in turn bogs down the memory for drastic blows to efficiency.
It may be better to use permGeneral from RcppAlgos
P <- perm(1:5) # OP's function
library(RcppAlgos)
P1 <- permuteGeneral(5, 5)
all.equal(P, P1, check.attributes = FALSE)
#[1] TRUE
Benchmarks
On a slightly longer sequence
system.time({
P2 <- permuteGeneral(8, 8)
})
#user system elapsed
# 0.001 0.000 0.001
system.time({
P20 <- perm(1:8) #OP's function
})
# user system elapsed
# 31.254 11.045 42.226
all.equal(P2, P20, check.attributes = FALSE)
#[1] TRUE
Generally, recursive function can take longer time as recursive calls to the function takes more execution time

R dataframe with nested vector

Background
In R this works:
> df <- data.frame(a=numeric(), b=numeric())
> rbind(df, list(a=1, b=2))
a b
1 1 2
But if I want the list to have a vector, rbind fails:
> df <- data.frame(a=numeric(), b=vector(mode="numeric"))
> rbind(df, list(a=1, b=c(2,3)))
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
And if I try to specify the vector length, declaring the dataframe fails:
> df <- data.frame(a=numeric(), b=vector(mode="numeric", length=2))
Error in data.frame(a = numeric(), b = vector(mode = "numeric", length = 2)) :
arguments imply differing number of rows: 0, 2
Finally, if I eschew declaring the dataframe and try rbind two lists directly, it looks like everything is working, but the datatypes are all wrong, and none of the columns appear to exist.
> l1 <- list(a=1, b=c(2,3))
> l2 <- list(a=10, b=c(20,30))
> obj <- rbind(l1, l2)
> obj
a b
l1 1 Numeric,2
l2 10 Numeric,2
> typeof(obj)
[1] "list"
> obj$a
NULL
> obj$b
NULL
> names(obj)
NULL
My setup
I have a embedded device that gathers data every 50ms and spits out a packet of data. In my script, I'm parsing a waveform that represents the states of that process (process previous frame and transmit, gather new data, dead time where nothing happens) with a state machine. For each packet I'm calculating the duration of the process period, the gathering data period which is subdivided into 8 or 16 acquisition cycles, where I calculate the time of each acquisition cycle, and the remaining dead time.
My list basically looks like `list(process=#, cycles=c(#,#,#,#), deadtime=#). Different packet types have different cycle lengths, so I pass that in as a parameter and I want the script to work on any packet time.
My question
Is there a way to declare a dataframe that does what I want, or am I using R in a fundamentally wrong way and I should break each cycle into it's own list element? I was hoping to avoid the latter as it will make treating the cycles as a group more difficult.
I will note that I've just started learning R so I'm probably doing some odd things with it.
Expected output
If I were to process 4 packets worth of signal with 3 acq. cycles each, this would be my ideal output:
df <- data.frame(processTime=numeric(), cyles=???, deadtime=numeric())
df <- rbind(df, list(processTime=0.05, cycles=c(0.08, 0.10, 0.07), deadtime=0.38)
etc...
processTime cycles deadtime
1 0.05 0.08 0.10 0.07 0.38
2 0.06 0.07 0.11 0.09 0.36
3 0.07 0.28 0.11 0.00 0.00
4 0.06 0.08 0.08 0.09 0.41
I'll take a different stab. Dealing with just your first 2 records.
processTime<-c(.05,.06)
cycles<-list(list(.08,.10,.07), list(.07,.09,.38))
deadtime<-c(.38,.36)
For cycles, we created a list element with a list that contains 3 elements in it. So cycles[[1]][1] would refer to .08, and cycles[[1]][2] would refer second element of the first list and cycles[[2]][3] would refer to the 3rd item in the second list.
If we use cbind to bind these we get the following:
test<-as.data.frame(cbind(processTime,cycles,deadtime))
test
processTime cycles deadtime
1 0.05 0.08, 0.10, 0.07 0.38
2 0.06 0.07, 0.09, 0.38 0.36
test$cycles[[1]] will return first list
test$cycles[[1]]
[[1]]
[[1]][[1]]
[1] 0.08
[[1]][[2]]
[1] 0.1
[[1]][[3]]
[1] 0.07
Whereas the 3rd element of the second list can be called with:
test$cycles[[2]][3]
[[1]]
[1] 0.38
You can also unlist later for calculations:
unlist(test$cycles[[2]])
[1] 0.07 0.09 0.38
To do this iteratively as you requested.
test<-data.frame()
processTime<-c(.05)
cycles<-list(list(.08,.10,.07))
deadtime<-c(.38)
test<-as.data.frame(cbind(processTime,cycles,deadtime))
test
processTime cycles deadtime
1 0.05 0.08, 0.10, 0.07 0.38
processTime<-c(.06)
cycles<-list(list(.07,.09,.38))
deadtime<-c(.36)
test<- rbind(test,as.data.frame(cbind(processTime,cycles,deadtime)))
test
processTime cycles deadtime
1 0.05 0.08, 0.10, 0.07 0.38
2 0.06 0.07, 0.09, 0.38 0.36

Why does the histogram object get printed when hist is called in j?

set.seed(120834)
DT <- data.table(var = rnorm(100))
Compare:
> hist(DT[ , var]) #of course plot is drawn
With
> DT[ , hist(var)]
$breaks
[1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
$counts
[1] 1 1 6 7 16 20 20 12 12 3 2
$density
[1] 0.02 0.02 0.12 0.14 0.32 0.40 0.40 0.24 0.24 0.06 0.04
$mids
[1] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
$xname
[1] "var"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Why does the latter print the object returned by hist, but not the former? Is there a way to suppress this?
I'm seeing in the code for hist.default that the object is returned invisibly:
if (plot) {
plot(r, freq = freq1, col = col, border = border, angle = angle,
density = density, main = main, xlim = xlim, ylim = ylim,
xlab = xlab, ylab = ylab, axes = axes, labels = labels,
...)
invisible(r)
}
(And looking at the error from DT[ , hist(var, NA)] confirms [.data.table is indeed invoking hist.default)
So how did [.data.table manage to supercede this invisible return call? Is it perhaps that the invisible object is returned to a hidden environment invisibly, but the object from that environment is then returned visibly?
This is a known bug, and, as of now, there's no known solution or plans to fix it.
See GH Issue #482 for updates.
Edit from Matt -- Workarounds
You can suppress it like this :
DT[ , {hist(var);NULL}]
or
invisible(DT[,hist(var)])
Looks like we need to revisit this as FAQ 2.18 needs an update: "2.18 I’m using j for its side effect only, but I’m still getting data returned. How do I stop that?" I have tried to better in the past, spending several man days on it. We are limited by R internals to some extent (as FAQ 2.22 explains)

Resources