rep function in R function - r

I understand why the rep function didn't work out by trial and error, and that in order for the random.sum(5) to work out, rep(100, 10) has to be rep(100, 5). but i do not understand why:
# clear the workspace
rm(list=ls())
random.sum <- function(n) {
x[1:n] <- ceiling(10*runif(n))
cat("x:", x[1:n] ,"\n")
return(sum(x))
}
set.seed(3585)
x <- rep(100,10)
show(random.sum(10))
x: 9 4 10 1 9 8 4 1 3 2
## [1] 51
show(random.sum(5))
x: 9 6 6 2 2
## [1] 525

It's because you are not creating a new variable x in your function, but taking a copy of the x in the enclosing environment, and modifying that. So sum(x) adds 10 elements, the final five of which have the value 100.
To fix, don't assign to a slice of x, assign the result of ceiling to a variable, of any name, even x:
random.sum <- function(n) {
x <- ceiling(10*runif(n))
cat("x:", x[1:n] ,"\n")
return(sum(x))
}
set.seed(3585)
random.sum(10)
## x: 9 4 10 1 9 8 4 1 3 2
## [1] 51
random.sum(5)
## x: 9 6 6 2 2
## [1] 25
Note the difference is 500, the final elements of the global x.

Related

R how to find a series of common values in a vector (identifying growing season)

I'm looking for a way to identify a growing season which consists of a number of days greater than say 60 between the last frost day of spring and the first frost day in the fall. A general version of this problem is this. If I have a vector of numbers like testVec, I want the item numbers of the beginning and end range of values where the number of items is 5 or greater and all of them are greater than 0.
testVec <- c(1,3,4,0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
In this example, the relevant range is 1,3,4,6,7,5,9 which is testVec[9] to testVec[15]
One option could be:
testVec[with(rle(testVec > 0), rep(lengths * values >= 5, lengths))]
[1] 1 3 4 6 7 5 9
Here, the idea is to, first, create runs of values that are smaller or equal to zero and bigger than zero. Second, it checks whether the runs of values bigger than zero are of length 5 or more. Finally, it subsets the original vector for the runs of values bigger than zero with length 5 or more.
1) rleid This also handles any number of sequences including zero. rleid(ok) is a vector the same length as ok such that the first run of identical elements is replaced with 1, the second run with 2 and so on. The result is a list of vectors where each vector has its positions in the original input as its names.
library(data.table)
getSeq <- function(x) {
names(x) <- seq_along(x)
ok <- x > 0
s <- split(x[ok], rleid(ok)[ok])
unname(s)[lengths(s) >= 5]
}
getSeq(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
getSeq(numeric(16))
## list()
getSeq(c(testVec, 10 * testVec))
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
##
## [[2]]
## 25 26 27 28 29 30 31
## 10 30 40 60 70 50 90
If a data frame were desired then following gives the values and which sequence the row came from. The row names indicate the positions in the original input.
gs <- getSeq(c(testVec, 10 * testVec))
names(gs) <- seq_along(gs)
if (length(gs)) stack(gs) else gs
## values ind
## 9 1 1
## 10 3 1
## 11 4 1
## 12 6 1
## 13 7 1
## 14 5 1
## 15 9 1
## 25 10 2
## 26 30 2
## 27 40 2
## 28 60 2
## 29 70 2
## 30 50 2
## 31 90 2
2) gregexpr Replace each element that is > 0 with 1 and each other element with 0 pasting the 0's and 1's into a single character string. Then use gregexpr to look for sequences of 1's at least 5 long and for the ith such nonoverlapping sequence return the first positions, g, and lengths, attr(g, "match.length"). Define a function vals which extracts the values at the required positions from testVec of the ith such nonoverlapping sequence returning a list such that the ith component of the list is the ith such sequence. The names in the output vector are its positions in the input.
getSeq2 <- function(x) {
g <- gregexpr("1{5,}", paste(+(x > 0), collapse = ""))[[1]]
vals <- function(i) {
ix <- seq(g[i], length = attr(g, "match.length")[i])
setNames(x[ix], ix)
}
if (length(g) == 1 && g == -1) list() else lapply(seq_along(g), vals)
}
getSeq2(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
The above handles any number of sequences including 0 but if we knew there were exactly one sequence (which is the case for the example in the question) then it could be simplified to the following where the return value is just that vector:
g <- gregexpr("1{5,}", paste(+(testVec > 0), collapse = ""))[[1]]
ix <- seq(g, length = attr(g, "match.length"))
setNames(testVec[ix], ix)
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
You could "fix" #tmfmnk's solution like this:
f1 <- function(x, threshold, n) {
range(which(with(rle(x > threshold), rep(lengths * values >= n, lengths))))
}
x <- c(1, 3, 4, 0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
f1(x, 0, 5)
#[1] 9 15
But that does not work well when there are multiple runs
xx <- c(x, x)
f1(xx, 0, 5)
#[1] 9 31
Here is another, not so concise approach that returns the start and end of the longest run (the first one if there are ties).
f2 <- function(x, threshold, n) {
y <- x > threshold
y[is.na(y)] <- FALSE
a <- ave(y, cumsum(!y), FUN=cumsum)
m <- max(a)
if (m < n) return (c(NA, NA))
i <- which(a == m)[1]
c(i-m+1, i)
}
f2(x, 0, 5)
#[1] 9 15
f2(xx, 0, 5)
#[1] 9 15
or with rle
f3 <- function(x, threshold, n) {
y <- x > threshold
r <- rle(y)
m <- max(r$lengths)
if (m < n) return (c(NA, NA))
i <- sum(r$lengths[1:which.max(r$lengths)[1]])
c(i-max(r$lengths)+1, i)
}
f3(x, 0, 5)
#[1] 9 15
f3(xx, 0, 5)
#[1] 9 15
If you wanted the first run that is at least n, that is you do not want a next run, even if it is longer, you could do
f4 <- function(x, threshold, n) {
y <- with(rle(x > threshold), rep(lengths * values >= n, lengths))
i <- which(y)[1]
j <- i + which(!y[-c(1:i)])[1] - 1
c(i, j)
}

Create N random integers with no gaps

For a clustering algorithm that I'm implementing, I would like to initialize the clusters assignments at random. However, I need that there are no gaps. That is, this is not ok:
set.seed(2)
K <- 10 # initial number of clusters
N <- 20 # number of data points
z_init <- sample(K,N, replace=TRUE) # initial assignments
z_init
# [1] 2 8 6 2 10 10 2 9 5 6 6 3 8 2 5 9 10 3 5 1
sort(unique(z_init))
# [1] 1 2 3 5 6 8 9 10
where labels 4 and 7 have not been used.
Instead, I would like this vector to be:
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
where the label 5 has become 4 and so forth to fill the lower empty labels.
More examples:
The vector 1 2 3 5 6 8 should be ̀1 2 3 4 5 6 7
The vector 15,5,7,7,10 should be ̀1 2 3 3 4
Can it be done avoiding for loops? I don't need it to be fast, I prefer it to be elegant and short, since I'm doing it only once in the code (for label initialization).
My solution using a for loop
z_init <- c(3,2,1,3,3,7,9)
idx <- order(z_init)
for (i in 2:length(z_init)){
if(z_init[idx[i]] > z_init[idx[i-1]]){
z_init[idx[i]] <- z_init[idx[i-1]]+1
}
else{
z_init[idx[i]] <- z_init[idx[i-1]]
}
}
z_init
# 3 2 1 3 3 4 5
Edit: #GregSnow came up with the current shortest answer. I'm 100% convinced that this is the shortest possible way.
For fun, I decided to golf the code, i.e. write it as short as possible:
z <- c(3, 8, 4, 4, 8, 2, 3, 9, 5, 1, 4)
# solution by hand: 1 2 3 3 4 4 4 5 6 6 7
sort(c(factor(z))) # 18 bits, as proposed by #GregSnow in the comments
# [1] 1 2 3 3 4 4 4 5 6 6 7
Some other (functioning) attempts:
y=table(z);rep(seq(y),y) # 24 bits
sort(unclass(factor(z))) # 24 bits, based on #GregSnow 's answer
diffinv(diff(sort(z))>0)+1 # 26 bits
sort(as.numeric(factor(z))) # 27 bits, #GregSnow 's original answer
rep(seq(unique(z)),table(z)) # 28 bits
cumsum(c(1,diff(sort(z))>0)) # 28 bits
y=rle(sort(z))$l;rep(seq(y),y) # 30 bits
Edit2: Just to show that bits isn't everything:
z <- sample(1:10,10000,replace=T)
Unit: microseconds
expr min lq mean median uq max neval
sort(c(factor(z))) 2550.128 2572.2340 2681.4950 2646.6460 2729.7425 3140.288 100
{ y = table(z) rep(seq(y), y) } 2436.438 2485.3885 2580.9861 2556.4440 2618.4215 3070.812 100
sort(unclass(factor(z))) 2535.127 2578.9450 2654.7463 2623.9470 2708.6230 3167.922 100
diffinv(diff(sort(z)) > 0) + 1 551.871 572.2000 628.6268 626.0845 666.3495 940.311 100
sort(as.numeric(factor(z))) 2603.814 2672.3050 2762.2030 2717.5050 2790.7320 3558.336 100
rep(seq(unique(z)), table(z)) 2541.049 2586.0505 2733.5200 2674.0815 2760.7305 5765.815 100
cumsum(c(1, diff(sort(z)) > 0)) 530.159 545.5545 602.1348 592.3325 632.0060 844.385 100
{ y = rle(sort(z))$l rep(seq(y), y) } 661.218 684.3115 727.4502 724.1820 758.3280 857.412 100
z <- sample(1:100000,replace=T)
Unit: milliseconds
expr min lq mean median uq max neval
sort(c(factor(z))) 84.501189 87.227377 92.13182 89.733291 94.16700 150.08327 100
{ y = table(z) rep(seq(y), y) } 78.951701 82.102845 85.54975 83.935108 87.70365 106.05766 100
sort(unclass(factor(z))) 84.958711 87.273366 90.84612 89.317415 91.85155 121.99082 100
diffinv(diff(sort(z)) > 0) + 1 9.784041 9.963853 10.37807 10.090965 10.34381 17.26034 100
sort(as.numeric(factor(z))) 85.917969 88.660145 93.42664 91.542263 95.53720 118.44512 100
rep(seq(unique(z)), table(z)) 86.568528 88.300325 93.01369 90.577281 94.74137 118.03852 100
cumsum(c(1, diff(sort(z)) > 0)) 9.680615 9.834175 10.11518 9.963261 10.16735 14.40427 100
{ y = rle(sort(z))$l rep(seq(y), y) } 12.842614 13.033085 14.73063 13.294019 13.66371 133.16243 100
It seems to me that you are trying to randomly assign elements of a set (the numbers 1 to 20) to clusters, subject to the requirement that each cluster be assigned at least one element.
One approach that I could think of would be to select a random reward r_ij for assigning element i to cluster j. Then I would define binary decision variables x_ij that indicate whether element i is assigned to cluster j. Finally, I would use mixed integer optimization to select the assignment from elements to clusters that maximizes the collected reward subject to the following conditions:
Every element is assigned to exactly one cluster
Every cluster has at least one element assigned to it
This is equivalent to randomly selecting an assignment, keeping it if all clusters have at least one element, and otherwise discarding it and trying again until you get a valid random assignment.
In terms of implementation, this is pretty easy to accomplish in R using the lpSolve package:
library(lpSolve)
N <- 20
K <- 10
set.seed(144)
r <- matrix(rnorm(N*K), N, K)
mod <- lp(direction = "max",
objective.in = as.vector(r),
const.mat = rbind(t(sapply(1:K, function(j) rep((1:K == j) * 1, each=N))),
t(sapply(1:N, function(i) rep((1:N == i) * 1, K)))),
const.dir = c(rep(">=", K), rep("=", N)),
const.rhs = rep(1, N+K),
all.bin = TRUE)
(assignments <- apply(matrix(mod$solution, nrow=N), 1, function(x) which(x > 0.999)))
# [1] 6 5 3 3 5 6 6 9 2 1 3 4 7 6 10 2 10 6 6 8
sort(unique(assignments))
# [1] 1 2 3 4 5 6 7 8 9 10
You could do like this:
un <- sort(unique(z_init))
(z <- unname(setNames(1:length(un), un)[as.character(z_init)]))
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
sort(unique(z))
# [1] 1 2 3 4 5 6 7 8
Here I replace elements of un in z_init with corresponding elements of 1:length(un).
A simple (but possibly inefficient) approach is to convert to a factor then back to numeric. Creating the factor will code the information as integers from 1 to the number of unique values, then add labels with the original values. Converting to numeric then drops the labels and leaves the numbers:
> x <- c(1,2,3,5,6,8)
> (x2 <- as.numeric(factor(x)))
[1] 1 2 3 4 5 6
>
> xx <- c(15,5,7,7,10)
> (xx2 <- as.numeric(factor(xx)))
[1] 4 1 2 2 3
> (xx3 <- as.numeric(factor(xx, levels=unique(xx))))
[1] 1 2 3 3 4
The levels = portion in the last example sets the numbers to match the order in which they appear in the original vector.

Abbreviate a numeric vector when displaying it in the console

I'd like to abbreviate a numeric vector when displaying it through the R console. I tried already the function ?abbreviate, but it does not the thing I want. In fact I want the whole vector to be abbreviated, not each element of the vector. In addition, I want to pass ... at the breaking position so to indicate it goes on. How can I make this?
x = 1:100
x 1, 2, 3, 4, 5, 6 ...
Try str().
x <- 1:100
str(x, vec.len = 2.5, give.head = FALSE)
# 1 2 3 4 5 6 ...
But David Arenburg makes a good suggestion with cat(). Here's a function which allows you to adjust the length more precisely.
f <- function(x, n) cat(x[1:n], "...")
f(x, 5)
# 1 2 3 4 5 ...
f(x, 9)
# 1 2 3 4 5 6 7 8 9 ...
Update: In response to your comment about putting the text name of the input before the values, you can adjust the function as follows.
f <- function(x, n) {
cat(substitute(x), head(x, n), if(n < length(x)) "...")
}
stuff <- 1:100
f(stuff, 6)
# stuff 1 2 3 4 5 6 ...
f(stuff, 12)
# stuff 1 2 3 4 5 6 7 8 9 10 11 12 ...

Recursively apply function to list elements

Using the rmatio package, I get nested lists similar to the following:
nestedlist <- list(
a = list( a = list(1:10), b = list(35)),
b = list(11:25)
)
Ideally, I want it to look like this (all lists with a single unnamed element replaced by the element):
nestedlist <- list(a = list(a=1:10, b=35), b = 11:25)
I tried the already tried the following:
unlist(nestedlist) # returns one vector with all elements
selective_unlist <- function(e)
if(is.list(e) &&is.null(names(e))) unlist(e) else e
# only calls the function with each leaf, so nothing gets replaced
rapply(nestedlist, how='replace', selective_unlist)
# works, but only for 2 levels
lapply(nestedlist, selective_unlist)
# works, but using explicit recursion is slow for large datasets
recursive_selective_unlist <- function(e)
if(is.list(e)) {
if(is.null(names(e))) unlist(e)
else lapply(e, recursive_selective_unlist)
} else e
Is there a better/faster way to simplify these nested lists or is the recursive function my best bet?
Following #Pafnucy's idea, I'd use
ff <- function(x) if (is.list(x[[1]])) lapply(x,ff) else unlist(x)
which does
ff(nestedlist)
# $a
# $a$a
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $a$b
# [1] 35
#
#
# $b
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# check result:
identical(list(a = list(a=1:10, b=35), b = 11:25),ff(nestedlist))
# [1] TRUE
Handling arbitrary depth of nesting:
f <- function(x) {
if (is.list(x)) unname(c(sapply(unlist(x), f))) else x
}
# sample data
nl2 <- list(a = list(a = list(1:5), b = list(1:5)))
nl3 <- list(p = nl2, q = c(9,9,9))
Intermediate output:
> f(nl2)
[1] 1 2 3 4 5 1 2 3 4 5
> f(nl3)
[1] 1 2 3 4 5 1 2 3 4 5 9 9 9
Adding last step, as f goes too deep and we want list with depth 1
unstackList <- function(x) lapply(x, f)
unstackList(nl3)
Output:
$p
[1] 1 2 3 4 5 1 2 3 4 5
$q
[1] 9 9 9

storing results of a for function in list or

add <- c( 2,3,4)
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
print(z)
}
# Result
[1] 13
[1] 15
[1] 17
In R, it can print the result, but I want to save the results for further computation in a vector, data frame or list
Thanks in advance
Try something like:
add <- c(2, 3, 4)
z <- rep(0, length(add))
idx = 1
for(i in add) {
a <- i + 3
b <- a + 3
z[idx] <- a + b
idx <- idx + 1
}
print(z)
This is simple algebra, no need in a for loop at all
res <- (add + 3)*2 + 3
res
## [1] 13 15 17
Or if you want a data.frame
data.frame(a = add + 3, b = add + 6, c = (add + 3)*2 + 3)
# a b c
# 1 5 8 13
# 2 6 9 15
# 3 7 10 17
Though in general, when you are trying to something like that, it is better to create a function, for example
myfunc <- function(x) {
a <- x + 3
b <- a + 3
z <- a + b
z
}
myfunc(add)
## [1] 13 15 17
In cases when a loop is actually needed (unlike in your example) and you want to store its results, it is better to use *apply family for such tasks. For example, use lapply if you want a list back
res <- lapply(add, myfunc)
res
# [[1]]
# [1] 13
#
# [[2]]
# [1] 15
#
# [[3]]
# [1] 17
Or use sapply if you want a vector back
res <- sapply(add, myfunc)
res
## [1] 13 15 17
For a data.frame to keep all the info
add <- c( 2,3,4)
results <- data.frame()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- rbind(results, cbind(a,b,z))
}
results
a b z
1 5 8 13
2 6 9 15
3 7 10 17
If you just want z then use a vector, no need for lists
add <- c( 2,3,4)
results <- vector()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- c(results, z)
}
results
[1] 13 15 17
It might be instructive to compare these two results with those of #dugar:
> sapply(add, function(x) c(a=x+3, b=a+3, z=a+b) )
[,1] [,2] [,3]
a 5 6 7
b 10 10 10
z 17 17 17
That is the result of lazy evaluation and sometimes trips us up when computing with intermediate values. This next one should give a slightly more expected result:
> sapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[,1] [,2] [,3]
a 5 6 7
b 8 9 10
z 13 15 17
Those results are the transpose of #dugar. Using sapply or lapply often saves you the effort off setting up a zeroth case object and then incrementing counters.
> lapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[[1]]
a b z
5 8 13
[[2]]
a b z
6 9 15
[[3]]
a b z
7 10 17

Resources