Generate a sequence of numbers with repeated intervals - r

I am trying to create sequences of number of 6 cases, but with 144 cases intervals.
Like this one for example
c(1:6, 144:149, 288:293)
1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
How could I generate automatically such a sequence with
seq
or with another function ?

I find the sequence function to be helpful in this case. If you had your data in a structure like this:
(info <- data.frame(start=c(1, 144, 288), len=c(6, 6, 6)))
# start len
# 1 1 6
# 2 144 6
# 3 288 6
then you could do this in one line with:
sequence(info$len) + rep(info$start-1, info$len)
# [1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
Note that this solution works even if the sequences you're combining are different lengths.

Here's one approach:
unlist(lapply(c(0L,(1:2)*144L-1L),`+`,seq_len(6)))
# or...
unlist(lapply(c(1L,(1:2)*144L),function(x)seq(x,x+5)))
Here's a way I like a little better:
rep(c(0L,(1:2)*144L-1L),each=6) + seq_len(6)
Generalizing...
rlen <- 6L
rgap <- 144L
rnum <- 3L
starters <- c(0L,seq_len(rnum-1L)*rgap-1L)
rep(starters, each=rlen) + seq_len(rlen)
# or...
unlist(lapply(starters+1L,function(x)seq(x,x+rlen-1L)))

This can also be done using seq or seq.int
x = c(1, 144, 288)
c(sapply(x, function(y) seq.int(y, length.out = 6)))
#[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
As #Frank mentioned in the comments here is another way to achieve this using #josilber's data structure (This is useful particularly when there is a need of different sequence length for different intervals)
c(with(info, mapply(seq.int, start, length.out=len)))
#[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293

From R >= 4.0.0, you can now do this in one line with sequence:
sequence(c(6,6,6), from = c(1,144,288))
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
The first argument, nvec, is the length of each sequence; the second, from, is the starting point for each sequence.
As a function, with n being the number of intervals you want:
f <- function(n) sequence(rep(6,n), from = c(1,144*1:(n-1)))
f(3)
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293

I am using R 3.3.2. OSX 10.9.4
I tried:
a<-c() # stores expected sequence
f<-288 # starting number of final sub-sequence
it<-144 # interval
for (d in seq(0,f,by=it))
{
if (d==0)
{
d=1
}
a<-c(a, seq(d,d+5))
print(d)
}
print(a)
AND the expected sequence stores in a.
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
And another try:
a<-c() # stores expected sequence
it<-144 # interval
lo<-4 # number of sub-sequences
for (d in seq(0,by=it, length.out = lo))
{
if (d==0)
{
d=1
}
a<-c(a, seq(d,d+5))
print(d)
}
print(a)
The result:
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293 432 433 434 435 436 437

I tackled this with cumsum function
seq_n <- 3 # number of sequences
rep(1:6, seq_n) + rep(c(0, cumsum(rep(144, seq_n-1))-1), each = 6)
# [1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
No need to calculate starting values of sequences as in the #josilber's solution, but the length of a sequence has to be constant.

Related

Finding index of the first non-zero element starting from the bottom of a column

In a data frame dt of dimension 76x108, I want to reduce the sum of values in each columns 13 to 108 by an amount stored in an array c, by minimising the non-zero elements in the column, starting from the last row.
For example, if dt[76,13] > 0, the following happens:
dt[76,13] <- max((dt[76,13]-c),0)
If after this operation dt[76,13] == 0, the residual of c - dt[76,13] should get subtracted from the next non-zero element in column 13. This goes on until the sum of all rows in column 13 is reduced by an amount equivalent to c.
This needs to be done for the 96 columns in dt[,13:108].
Edited: Added an example with a smaller data frame below.
dt <- data.frame(Plant = sample(LETTERS,10,replace=T),
Type = rep("Base",10),
Ownership = rep("Pub",10))
caps = matrix(round(runif(10*5,0,500),0),nrow=10,ncol=5)
dt <- as.data.frame(cbind(dt,caps)) #this what the data frame looks like
for(i in 1:5){
colnames(dt)[i+3] <- (paste0("TB",i))
}
dt
Plant Type Ownership TB1 TB2 TB3 TB4 TB5
1 T Base Pub 454 32 162 271 478
2 S Base Pub 275 75 385 491 60
3 Y Base Pub 314 44 252 221 363
4 T Base Pub 170 122 490 332 123
5 J Base Pub 241 178 173 472 468
6 B Base Pub 243 316 152 411 434
7 T Base Pub 127 167 356 451 400
8 U Base Pub 20 102 54 182 57
9 O Base Pub 368 333 236 103 27
10 J Base Pub 343 189 0 494 184
c <- c(500,200,217,50,300)
#required output
Plant Type Ownership TB1 TB2 TB3 TB4 TB5
1 T Base Pub 454 32 162 271 478
2 S Base Pub 275 75 385 491 60
3 Y Base Pub 314 44 252 221 363
4 T Base Pub 170 122 490 332 123
5 J Base Pub 241 178 173 472 468
6 B Base Pub 243 316 152 411 434
7 T Base Pub 127 167 356 451 368
8 U Base Pub 20 102 54 182 0
9 O Base Pub 211 322 19 103 0
10 J Base Pub 0 0 0 444 0
#dt[10,4] is now max((343-500),0), while dt[9,4] is 368-(500-343).
#dt[10,5] is now max((189-200),0), while dt[9,4] is 333-(200-189).
#and so on.
What I've tried so far looks something like this:
for(i in 4:8){
j <- nrow(dt) #start from the last row
if(dt[j,i]>0){
res1 <- c[i] - dt[j,i] #residual value of the difference
dt[j,i] <- max((dt[j,i] - c[i]),0)
while(res1>0){ #the process should continue until an amount equivalent to c[i] is not subtracted from dt[j,i]
j <- j-1
p <- dt[j,i]
dt[j,i] <- max((dt[j,i] - res1),0)
res1 <- res1 - p
}
}
else if(dt[j,i]==0){ #if the last element of the column is already 0, process should start w/ the first non-zero element
j <- j-1
res1 <- c[i] - dt[j,i]
dt[j,i] <- max((dt[j,i] - c[i]),0)
while(res1>0){
j <- j-1
p <- dt[j,i]
dt[j,i] <- max((dt[j,i] - res1),0)
res1 <- res1 - p
}
}
}
This will do your purpose. (I renamed your vector c with cc so that it may not interact with function c)
df <- read.table(text = ' Plant Type Ownership TB1 TB2 TB3 TB4 TB5
1 T Base Pub 454 32 162 271 478
2 S Base Pub 275 75 385 491 60
3 Y Base Pub 314 44 252 221 363
4 T Base Pub 170 122 490 332 123
5 J Base Pub 241 178 173 472 468
6 B Base Pub 243 316 152 411 434
7 T Base Pub 127 167 356 451 400
8 U Base Pub 20 102 54 182 57
9 O Base Pub 368 333 236 103 27
10 J Base Pub 343 189 0 494 184', header =T)
cc <- c(500,200,217,50,300)
library(tidyverse)
df %>% arrange(rev(row_number())) %>%
mutate(across(starts_with('TB'), ~ . - c(first(pmin(cc[as.integer(str_remove(cur_column(), 'TB'))],
cumsum(pmin(., cc[as.integer(str_remove(cur_column(), 'TB'))])))),
diff(pmin(cc[as.integer(str_remove(cur_column(), 'TB'))],
cumsum(pmin(., cc[as.integer(str_remove(cur_column(), 'TB'))])))))
)) %>%
arrange(rev(row_number()))
#> Plant Type Ownership TB1 TB2 TB3 TB4 TB5
#> 1 T Base Pub 454 32 162 271 478
#> 2 S Base Pub 275 75 385 491 60
#> 3 Y Base Pub 314 44 252 221 363
#> 4 T Base Pub 170 122 490 332 123
#> 5 J Base Pub 241 178 173 472 468
#> 6 B Base Pub 243 316 152 411 434
#> 7 T Base Pub 127 167 356 451 368
#> 8 U Base Pub 20 102 54 182 0
#> 9 O Base Pub 211 322 19 103 0
#> 10 J Base Pub 0 0 0 444 0
Created on 2021-05-24 by the reprex package (v2.0.0)

How to find out the sequence of value in R

Supposed that I have generated 100 different value from -100 to 100, and I have cumsum all of those value.
set.seed(123)
x <- -100:100
z <- sample (x, size = 100,replace=T)
cumsum(z)
and I got
[1] 58 136 49 143 212 161 178 120 33 50 102 91 81 177 167 251 242 278 276 247 172 78 147
[24] 183 246 223 203 145 147 163 138 180 111 119 25 61 129 102 24 78 165 117 151 103 157 222
[47] 155 123 94 69 31 71 67 57 109 46 -34 -94 -20 -31 -72 -157 -142 -149 -244 -145 -160 -175 -237
[70] -179 -162 -213 -280 -377 -465 -497 -471 -419 -468 -547 -559 -500 -576 -642 -575 -564 -635 -596 -538 -518 -509 -452
[93] -489 -448 -350 -384 -334 -313 -335 -351
Now, I would like to stop or find out the value that is greater than 200 or lower than -200.
If I do it by my hand, I know that the 5th sequence (212) is greater than 200.
However, in R, is there any command to find out the first time that z is greater than 200 or lower than -200?
Thank you very much
A quick hack way to do this might be:
z <- as.data.frame(z)
z$lv <- if_else(z >200,T,F)
min(which(lv == TRUE))
The min(which(...)) solutions provided by others don't give a convenient answer in case none of the values meet the condition. For example,
set.seed(123)
x <- -100:100
z <- sample (x, size = 100,replace=T)
min(which(abs(cumsum(z)) > 200))
#> [1] 5
min(which(abs(cumsum(z)) > 1000)) # None meet this condition
#> Warning in min(which(abs(cumsum(z)) > 1000)): no non-missing arguments to min;
#> returning Inf
#> [1] Inf
A better way is given in the R help page for which.max:
match(TRUE, abs(cumsum(z)) > 200)
#> [1] 5
match(TRUE, abs(cumsum(z)) > 1000)
#> [1] NA

How to allocate groups of a data frame based on time in R

Hello I have a table like so:
Entry TimeOn TimeOff Alarm
1 60 70 355
2 80 85 455
3 100 150 400
4 105 120 320
5 125 130 254
6 135 155 220
7 160 170 850
I would like to understand how i can group those entries so the ones starting during another alarm and ending either during another alarm or after the other alarm such as entries 4,5 & 6 can be filtered out of the data frame?
so this would be the desired result a dataframe that looked like this:
Entry TimeOn TimeOff Alarm
1 60 70 355
2 80 85 455
3 100 150 400
7 160 170 850
so entries 4, 5 and 6 removed
library(dplyr)
library(data.table)
df$flag <- apply(df, 1, function(x) {
nrow(filter(df, data.table::between(x['TimeOn'],df$TimeOn,df$TimeOff)))
})
df[df$flag > 1, ]
Entry TimeOn TimeOff Alarm flag
4 4 105 120 320 2
5 5 125 130 254 2
6 6 135 155 220 2
#Save option using Base R
df$flag <- apply(df,1,function(x) {nrow(df[x['TimeOn'] >= df$TimeOn & x['TimeOn'] <= df$TimeOff,])})
Suggested by #Andre Elrico
df[apply(df, 1, function(x) { nrow( df[between(x[['TimeOn']],df$TimeOn,df$TimeOff),] ) > 1 }),]
data
df <- read.table(text="
Entry TimeOn TimeOff Alarm
1 60 70 355
2 80 85 455
3 100 150 400
4 105 120 320
5 125 130 254
6 135 155 220
7 160 170 850
",header=T)

Mapping cut factor variable to numeric in R

I have a factor variable represented by the histogram bins with values: '660-664' , ... , '740-744' , 745-749' ..
How can I map the factor variable to its mean value, e.g mapping '660-664' to 662?
Basically, what I'm looking for is the inverse of the "cut" function.
You can make use of the plot = FALSE argument from hist to extract the breaks, then use that to get your midpoints:
set.seed(1)
x <- sample(300, 30)
x
# [1] 80 112 171 270 60 266 278 194 184 18 296 52 198 111 221 142
# [17] 204 281 108 219 262 290 182 35 74 107 4 105 237 93
temp <- hist(x, plot = FALSE)$breaks
temp
# [1] 0 50 100 150 200 250 300
rowMeans(cbind(head(temp, -1),
tail(temp, -1)))
# [1] 25 75 125 175 225 275
Update: Calculating the mean from a character string of ranges
Judging by your comments, you might be looking for something more like this:
myVec <- c("735-739", "715-719", "690-694", "695-699", "695-699",
"670-674", "720-724", "705-709", "685-689")
myVec
# [1] "735-739" "715-719" "690-694" "695-699" "695-699" "670-674"
# [7] "720-724" "705-709" "685-689"
sapply(strsplit(myVec, "-"), function(x) mean(as.numeric(x)))
# [1] 737 717 692 697 697 672 722 707 687

Summarizing a data frame

I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))

Resources