I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?
I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.
I have a vector with different values (positive and negative), so, I want to select only the 10 lowest odd number values, and the 10 lowest pair values. Help me, please!
This is a way to do it using base R.
vector with odd and even numbers
x <- sample(-100:100, 30)
The modulus operator in R help to get the job done. You can use it this way
c(
# Extract the lowest even numbers
head(sort(x[x %% 2 == 0]), 5),
# Extract the lowest odds numbers
head(sort(x[x %% 2 == 1]), 5)
)
Given vector vas your input vector, you can obtain the desired output (including positions) via the following code
names(v) <- seq_along(v)
# lowest 10 odd numbers
low_odd <- sort(v[v%%2==1])[1:10]
# positions of those odd numbers in v
low_odd_pos <- as.numeric(names(low_odd))
# lowest 10 even numbers
low_even <- sort(v[v%%2==0])[1:10]
# positions of those even numbers in v
low_even_pos <- as.numeric(names(low_even))
Example
set.seed(1)
v <- sample(-50:50)
then
> low_odd
43 101 39 95 85 72 7 73 45 29
-49 -47 -45 -43 -41 -39 -37 -35 -33 -31
> low_odd_pos
[1] 43 101 39 95 85 72 7 73 45 29
This question already has answers here:
R: How to use ifelse statement for a vector of characters
(2 answers)
Closed 6 years ago.
My friend gave me a brain teaser that I wanted to try on R.
Imagine 100 coins in a row, with heads facing up for all coins. Now every 2nd coin is flipped (thus becoming tails). Then every 3rd coin is flipped. How many coins are now showing heads?
To create the vector, I started with:
flips <- rep('h', 100)
levels(flips) <- c("h", "t")
Not sure how to proceed from here. Any help would be appreciated.
Try this:
coins <- rep(1, 100) # 1 = Head, 0 = Tail
n = 3 # run till the time when you flip every 3rd coin
invisible(sapply(2:n function(i) {indices <- seq(i, 100, i); coins[indices] <<- (coins[indices] + 1) %% 2}) )
which(coins == 1)
# [1] 1 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53 54 55 59 60 61 65 66 67 71 72 73 77 78 79 83 84 85 89 90 91 95 96 97
sum(coins==1)
#[1] 49
If you run till n = 100, only the coins at the positions which are perfect squares will be showing heads.
coins <- rep(1, 100) # 1 = Head, 0 = Tail
n <- 100
invisible(sapply(2:n, function(i) {indices <- seq(i, 100, i); coins[indices] <<- (coins[indices] + 1) %% 2}) )
which(coins == 1)
# [1] 1 4 9 16 25 36 49 64 81 100
sum(coins==1)
# [1] 10
I'm interested in developing a modified bootstrap that samples some vector of length x, with replacement, but must meet a number of number of criteria before stopping the sampling. I'm attempting to calculate confidence intervals for lambda of a populations growth rate, 10000 iterations, but in some groupings of individuals, say vector 13, there are very few individuals growing out of the group. Typical bootstrapping would lead to a fair number instances where growth in this vector does not occur and hence the model falls apart. Each vector consists of a certain number of 1's, 2's, and 3's where 1 represents staying within a group, 2 growing out of a group, and 3 death. Here is what I have so far without the modification, it is likely not the best approach time wise, but I am new to R.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, replace=T)
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay <- rbind(stagestay,stay)
stagemoved <- rbind(stagemoved,moved)
}
Currently, this samples
My question is then: In what way can I modify the sample function to continue sampling these numbers until the length of "index" is at least the same as st13 AND until at least 1 instance of a 2 is present in "index"?
Thanks very much,
Kristopher Hennig
Masters Student
University of Mississippi
Oxford, MS, 38677
Update:
The answer from #lselzer reminded me that the requirement was for the length of the sample to be at least as long as st13. My code above just keeps sampling until it finds a bootstrap sample that contains a 2. The code of #lselzer grows the sample, 1 new index at a time, until the sample contains a 2. This is quite inefficient as you might have to call sample() many times till you get 2. My code might repeat a long time before a 2 is returned in the sample. So can we do any better?
One way would be to sample a large sample with replacement using a single call to sample(). Check which are 2s and see if there is a 2 within the first length(st13) entries. If there is, return those entries, if not, find the first 2 in the large sample and return all entries up to an including that one. If there are no 2s, add on another large sample and repeat. Here is some code:
#runs
n <- 100 #00
stage <- st13
stagedead <- stagemoved <- stagestay <- Size <- vector()
sampSize <- 100 * (len <- length(stage)) ## sample size to try
for(i in seq_len(n)){
## take a large sample
samp <- sample(stage, size = sampSize, replace = TRUE)
## check if there are any `2`s and which they are
## and if no 2s expand the sample
while(length((twos <- which(samp == 2))) < 1) {
samp <- c(samp, sample(stage, size = sampSize, replace = TRUE))
}
## now we have a sample containing at least one 2
## so set index to the required set of elements
if((min.two <- min(twos)) <= len) {
index <- samp[seq_len(len)]
} else {
index <- samp[seq_len(min.two)]
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
Size[i] <- length(index)
}
Here is a really degenerate vector with only a single 2 in 46 entries:
R> st14 <- sample(c(rep(1, 45), 2))
R> st14
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1
If I use the above loop on it rather than st13, I get the following for the minimum sample size required to get a 2 on each of the 100 runs:
R> Size
[1] 65 46 46 46 75 46 46 57 46 106 46 46 46 66 46 46 46 46
[19] 46 46 46 46 46 279 52 46 63 70 46 46 90 107 46 46 46 87
[37] 130 46 46 46 46 46 46 60 46 167 46 46 46 71 77 46 46 84
[55] 58 90 112 52 46 53 85 46 59 302 108 46 46 46 46 46 174 46
[73] 165 103 46 110 46 80 46 166 46 46 46 65 46 46 46 286 71 46
[91] 131 61 46 46 141 46 46 53 47 83
So it would suggest that the sampSize I chose (100 * length(stage)) is a bit of overkill here but as all the operators we are using are vectorised we probably don't incur to much penalty for the overly long initial sample size, and we certainly don't incur any extra sample() calls.
Original:
If I understand you correctly, the problem is that sample() might not return any 2 indicies at all. If so, we can continue sampling until it does using the repeat control flow construct.
I've altered your code accordingly, and optimised it a bit because you never grow objects in a loop like you were doing. There are other ways this could be improved, but I'll stick with the loop for now. Explanation comes below.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagedead <- stagemoved <- stagestay <- vector()
for(i in seq_len(n)){
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
}
This is the main change related to your Q:
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
what this does is repeat the code contained in the braces until a break is triggered to jump us out of the repeat loop. So what happens is we take a bootstrap sample, then check if any of the sample contains the index 2. If there are any 2s then we break out and carry on with the rest of the current for loop iteration. If the sample doesn't contain any 2s, the break is not triggered and we go round again taking another sample. This will happen until we do get a sample with a 2 in it.
For starters, sample has a size argument which you could use to match the length of st13. The second part of your question could be solved using a while loop.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, length(stage), replace=T)
while(!any(index == 2)) {
index <- c(index, sample(stage, 1, replace = T))
}
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay[i] <- stay
stagemoved[i] <- moved
}
While I was writing this Gavin posted his answer which is similar to mine, but I added the size argument to be sure index has at least the lenght of st13