I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0
I want to generate a matrix(4 rows and 30 columns) in R software, with random elements, by range of the elements between 0 and 1, which the sum of each rows equal to 1.
Here's a solution based on the softmax (multinomial logit) transform.
m <- matrix(rnorm(4 * 30), nrow=30)
prob <- exp(m)/rowSums(exp(m))
rowSums(prob)
#[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
all(prob > 0 & prob < 1)
#[1] TRUE
If you pick n numbers in [0,1] which sum to 1 you are in effect picking n-1 breakpoints. You can pick the breakpoints and then work backwards to the numbers:
rand.sum <- function(n){
x <- sort(runif(n-1))
c(x,1) - c(0,x)
}
And then
t(replicate(4,rand.sum(30)))
will be a 4x30 matrix of random numbers where eaxch row sums to 1.
I have a vector that contains a sequence of 1 and 0. Suppose of it is of length 166 and it is
y <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1, 1,1,1,1,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1)
Now I want to to extract a LONGEST POSSIBLE sub vector from above vector such that it satisfies two properties
(1) sub-vector should start from 1 and end with 1.
(2) It can contain up to 5% zeros of total length of sub-vector.
I started with rle function. It counts the 1 and 0 at each step.
So it will be like
z <- rle(y)
d <- data.frame(z$values, z$lengths)
colnames(d) <- c("value", "length")
It gives me
> d
value length
1 1 22
2 0 1
3 1 13
4 0 1
5 1 2
6 0 1
7 1 1
8 0 1
9 1 1
10 0 5
11 1 1
12 0 3
13 1 2
14 0 1
15 1 1
16 0 1
17 1 74
18 0 2
19 1 17
20 0 1
21 1 2
22 0 1
23 1 3
24 0 5
25 1 4
In this case 74 + 2+ 17 + 1 + 2 + 3 = 99 is the required sub-sequence as it contains 2+1+1=4 zeros which is less than 5% of 99. If we move forward and sequence will become 99+5+4 =108 and zeros will be 4+5=9 which will be more than 5% of 108.
I think you are very close by computing the run-length encoding of this vector. All that remains is to consider all pairs of runs of 1's and to select the pair that is of the longest length and matches the "no more than 5% zeros" rule. This can be done in a fully vectorized manner using combn to compute all pairs of runs of 1's and cumsum to get lengths of runs from the rle output:
ones <- which(d$value == 1)
# pairs holds pairs of rows in d that correspond to runs of 1's
if (length(ones) >= 2) {
pairs <- rbind(t(combn(ones, 2)), cbind(ones, ones))
} else if (length(ones) == 1) {
pairs <- cbind(ones, ones)
}
# Taking cumulative sums of the run lengths enables vectorized computation of the lengths
# of each run in the "pairs" matrix
cs <- cumsum(d$length)
pair.length <- cs[pairs[,2]] - cs[pairs[,1]] + d$length[pairs[,1]]
cs0 <- cumsum(d$length * (d$value == 0))
pair.num0 <- cs0[pairs[,2]] - cs0[pairs[,1]]
# Multiple the length of a pair by an indicator for whether it's valid and take the max
selected <- which.max(pair.length * ((pair.num0 / pair.length) <= 0.05))
d[pairs[selected,1]:pairs[selected,2],]
# value length
# 15 1 1
# 16 0 1
# 17 1 74
# 18 0 2
# 19 1 17
# 20 0 1
# 21 1 2
# 22 0 1
# 23 1 3
We actually found a subvector that is slightly longer that the one found by the OP: it has 102 elements and five 0's (4.90%).
I'm doing a failure analysis, for which I like to try some different scenarios and some random trials. So far I've done this with the mosaic package and its working out great.
In one specific scenario I want to generate a vector of (semi)random numbers with from different distributions. No problem so far.
Now I want to have defined number of negative numbers in this vector.
For example I want to have between 0-5 negative numbers in the vector of 25 numbers.
I thought I could use something like rbinom(n=25,prob=5/25,size=1) to get 5 random ones first but of course 5/25, 25 times can be more than 5 ones. This seems a dead end.
I could get it done with some for loops, but probably something easier exists.
I've tried all sorts of sample,seq, shuffle combinations but I cannot get it to work so far.
does anyone have any ideas or suggestions?
If you have a vector x where all elements are >= 0, let's say drawn from Poisson:
x = rpois(25, lambda=3)
You can make a random 5 of the negative by doing
x * sample(rep(c(1, -1), c(length(x) - 5, 5)))
This works because
rep(c(1, -1), c(length(x) - 5, 5))
will be
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1
and sample(rep(c(1, -1), c(length(x) - 5, 5))) simply shuffles them up randomly:
sample(rep(c(1, -1), c(length(x) - 5, 5)))
# [1] 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1 -1 -1 1 1 1 -1 1 1 1 1
I can suggest a very straightforward solution, guaranteeing 5 negative values and working for any continuous distribution. The idea is just to sort the vector and substract the 6th biggest to each value:
x <- rnorm(25)
res <- sort(x, T)[6] - x
#### [1] 0.4956991 1.5799885 2.4207497 1.1639569 0.2161187 0.2443917 -0.4942884 -0.2627706 1.5188197
#### [10] 0.0000000 1.6081025 1.4922573 1.4828059 0.3320079 0.3552913 -0.6435770 -0.3106201 1.5074491
#### [19] 0.6042724 0.3707655 -0.2624150 1.1671077 2.4679686 1.0024573 0.2453597
sum(res<0)
#### [1] 5
It also works for discrete distributions but only if there are no ties..