reducing repetitive tasks in data.table in R - r

I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?

I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.

Related

Modifying for loop with if conditions to apply format in R

I am creating a variable called indexPoints that contains a subset of index values that passed certain conditions -
set.seed(1)
x = abs(rnorm(100,1))
y = abs(rnorm(100,1))
threshFC = 0.5
indexPoints=c()
seqVec = seq(1, length(x))
for (i in seq_along(seqVec)){
fract = x[i]/y[I]
fract[1] = NaN
if (!is.nan(fract)){
if(fract > (threshFC + 1) || fract < (1/(threshFC+1))){
indexPoints = c(indexPoints, i)
}
}
}
I am trying to recreate indexPoints using a more efficient method like apply methods (any except sapply). I started the process as shown below -
set.seed(1)
x = abs(rnorm(100,1))
y = abs(rnorm(100,1))
threshFC = 0.5
seqVec <- seq_along(x)
fract = x[seqVec]/y[seqVec]
fract[1] = NaN
vapply(fract, function(i){
if (!is.nan(fract)){ if(fract > (threshFC + 1) || fract < (1/(threshFC+1))){ i}}
}, character(1))
However, this attempt causes an ERROR:
Error in vapply(fract, function(i) { : values must be length 1,
but FUN(X[[1]]) result is length 0
How can I continue to modify the code to make it in an apply format. Note: sometimes, the fract variable contains NaN values, which I mimicked for the minimum examples above by using "fract[1] = NaN".
There are several problems with your code:
You tell vapply that you expect the internal code to return a character, yet the only thing you ever return is i which is numeric;
You only explicitly return something when all conditions are met, which means if the conditions are not all good, you do not return anything ... this is the same as return(NULL) which is also not character (try vapply(1:2, function(a) return(NULL), character(1)));
You explicitly set fract[1] = NaN and then test !is.nan(fract), so you will never get anything; and
(Likely a typo) You reference y[I] (capital "i") which is an error unless I is defined somewhere (which is no longer a syntax error but is now a logical error).
If I fix the code (remove NaN assignment) in your for loop, I get
indexPoints
# [1] 3 4 5 6 10 11 12 13 14 15 16 18 20 21 25 26 28 29 30 31 32 34 35 38 39
# [26] 40 42 43 44 45 47 48 49 50 52 53 54 55 56 57 58 59 60 61 64 66 68 70 71 72
# [51] 74 75 77 78 79 80 81 82 83 86 88 89 90 91 92 93 95 96 97 98 99
If we really want to do this one at a time (I recommend against it, read below), then there are a few methods:
Use Filter to only return the indices where the condition is true:
indexPoints2 <- Filter(function(i) {
fract <- x[i] / y[i]
!is.nan(fract) && (fract > (threshFC+1) | fract < (1/(threshFC+1)))
}, seq_along(seqVec))
identical(indexPoints, indexPoints2)
# [1] TRUE
Use vapply correctly, returning an integer either way:
indexPoints3 <- vapply(seq_along(seqVec), function(i) {
fract <- x[i] / y[i]
if (!is.nan(fract) && (fract > (threshFC+1) | fract < (1/(threshFC+1)))) i else NA_integer_
}, integer(1))
str(indexPoints3)
# int [1:100] NA NA 3 4 5 6 NA NA NA 10 ...
indexPoints3 <- indexPoints3[!is.na(indexPoints3)]
identical(indexPoints, indexPoints3)
# [1] TRUE
(Notice the explicit return of a specific type of NA, that is NA_integer_, so that vapply is happy.)
We can instead just return the logical if the index matches the conditions:
logicalPoints4 <- vapply(seq_along(seqVec), function(i) {
fract <- x[i] / y[i]
!is.nan(fract) && (fract > (threshFC+1) | fract < (1/(threshFC+1)))
}, logical(1))
head(logicalPoints4)
# [1] FALSE FALSE TRUE TRUE TRUE TRUE
identical(indexPoints, which(logicalPoints4))
# [1] TRUE
But really, there is absolutely no need to use vapply or any of the apply functions, since this can be easily (and much more efficiently) checked as a vector:
fract <- x/y # all at once
indexPoints5 <- which(!is.nan(fract) & (fract > (threshFC+1) | fract < (1/(threshFC+1))))
identical(indexPoints, indexPoints5)
# [1] TRUE
(If you don't use which, you'll see that it gives you a logical vector indicating if the conditions are met, similar to bullet 3 above with logicalPoints4.)

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Integers that are not divisible by several numbers

I am trying to print a vector with the integers between 1 and 100 that are not divisible by 2, 3 and 7 in R.
I tried seq but I am not sure how to continue.
Another option is to use Filter to, well, filter the sequence for any number that meets your condition:
Filter(function(i) { all(i %% c(2,3,7) != 0) }, seq(100))
## [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
Note that while this may (IMO) be the most readable, it's the worst in terms of performance (so far):
UPDATED to take into account rawr's for loop solution:
microbenchmark(
filter={ v1 <- seq(100); Filter(function(i) { all(i %% c(2,3,7) != 0) }, v1) },
reduce={ v1 <- seq(100); v1[!Reduce(`|`,lapply(c(2,3,7), function(x) !(v1 %%x)))] },
rowout={ v1 <- seq(100); v1[rowSums(outer(v1, c(2, 3, 7), "%%") == 0) == 0] },
looopy={ v1 <- seq(100); for (ii in c(2,3,7)) v1 <- v1[-which(v1 %% ii == 0)]; v1 },
times=1000
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## filter 108.280 118.7000 143.88592 126.2155 136.6290 2349.952 1000 c
## reduce 21.552 23.8095 25.91997 24.8150 25.8580 144.067 1000 ab
## rowout 26.075 28.4920 31.11812 29.5350 31.2125 184.225 1000 b
## looopy 14.149 16.0765 18.11806 16.8995 17.8595 160.485 1000 a
To make it fair I added sequence generation to all of them (and, I was doing this to compare relative performance vs actual speed anyway, so the comparison results still work).
Original statement:
"Unsurprisingly, akrun's is optimal :-)"
is now superseded by:
"Unsurprisingly, rawr's is optimal :-)"
Basically you want to compute each of the numbers in 1:100 modulo 2, 3, and 7. You could use outer to perform all the modulo operations in a single vectorized operation, using rowSums to identify the elements in 1:100 that are not perfectly divided by 2, 3, or 7.
v1 <- 1:100
v1[rowSums(outer(v1, c(2, 3, 7), "%%") == 0) == 0]
# [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
We can do this in a loop using lapply using the modulo operator, convert the 0 to TRUE by negating (!), use Reduce with | to find the corresponding list elements that are either TRUE, negate and subset the 'v1'
v1[!Reduce(`|`,lapply(c(2,3,7), function(x) !(v1 %%x)))]
Or instead of looping, this can be also done in a faster way.
v1[!(!v1%%2) + (!v1%%3) + (!v1%%7)]
data
v1 <- seq(100)
The other answers are better, but if you really need to use a for loop, as this question suggests, this could be a possibility:
x <- vector()
n <- 1L
for(i in 1:100){if (i%%2!=0 & i%%3!=0 & i%%7!=0) {x[n] <- i; n <- n+1}}
#> x
# [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
As already mentioned, the other answers posted here are better because they exploit the vectorized capabilities of R. The short code shown here is probably slower than any of the other answers and more complicated to maintain. It is the typical syntax of other programming languages, like C or FORTRAN, applied to R. It works, but it is not the way things should be done.
Rather than using modulo arithmetic explicitly, we can generate the negative modulo sequence easily by counting down. Then for each of the three sequences, we can OR them all together, then drop it into which().
which(as.logical(pmin(rep_len(1:0, 100),
rep_len(2:0, 100),
rep_len(6:0, 100))))
If we want to be a bit less hardcoded, we might use do.call with lapply():
which(as.logical(do.call(pmin, lapply(c(2,3,7)-1, function(x)rep_len(x:0, 100)))))
EDIT:
Here's one way to do it using logicals:
v1 <- logical(100); for (ii in c(2,3,7) -1) v1 <- v1 | rep_len(rep(c(F,T), c(ii,1)), 100) ; which(!v1)
I had the same problem in my class. I assumed the teacher gave me all the information I needed to find the answer and I was correct. This is week one and all that other silly stuff all you other advanced people used has not came up.
I did this though.
r = c(1:100)
which(r %% 3 == 0 & r %% 7 == 0 & r %% 2 == 0)
Use the which function.

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

replacing specific elements of a vector

I am trying to make a user-defined function below using the R
wrkexpcode.into.month <- function(vec) {
tmp.vec <- vec
tmp.vec[tmp.vec == 0 | tmp.vec == 9] <- NA
tmp.vec[tmp.vec == 1] <- 4
tmp.vec[tmp.vec == 2] <- 13
tmp.vec[tmp.vec == 3] <- 31
tmp.vec[tmp.vec == 4] <- 78
tmp.vec[tmp.vec == 5] <- 174
tmp.vec[tmp.vec == 6] <- 240
return (tmp.vec)
}
but when I execute with a simple command like
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
the result comes like
[1] 31 13 13 31 78 31 174 240 78
but I expect the result like
[1] 31 13 13 31 **4** 31 174 240 78
How can I fix this?
You have to carefully follow the flow of your function, evaluating what the values are. You are expecting 1 to be replaced by 4 based on tmp.vec[tmp.vec == 1] <- 4, however in tmp.vec[tmp.vec == 4] <- 78 later down the road, the 4 is replaced by a 78. This is caused by replacing the values in tmp.vec and using tmp.vec for determining what needs to be replaced. Like #MattewPlourde said, you need to base the replacement on vec:
tmp.vec[vec == 1] <- 4
Although I would simply replace the code by:
wrkexpcode.into.month <- function(vec) {
translation_vector = c('0' = NA, '1' = 4, '2' = 13, '3' = 31,
'4' = 78, '5' = 174, '6' = 240, '9' = NA)
return(translation_vector[as.character(vec)])
}
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
# 3 2 2 3 1 3 5 6 4
# 31 13 13 31 4 31 174 240 78
See also a blogpost I wrote recently about this kind of operation.
It think it will be much easier to use one of the many recode functions that are designed for such purposes instead of hard-coding it. It's just a one-liner then, e.g.
library(likert)
x <- c(3,2,2,3,1,3,5,6,4)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
[1] 31 13 13 31 4 31 174 240 78
And if desired, wrap it into a function, e.g.
wrkexpcode.into.month <- function(x)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
wrkexpcode.into.month(x)
[1] 31 13 13 31 4 31 174 240 78
You could create matrix pointing the input value (column1) to the desired output value (column2)
table=matrix(c(0,1,2,3,4,5,6,9,NA,4,13,31,78,174,240,NA),ncol=2)
And using sapply on the vector c(3,2,2,3,1,3,5,6,4)
sapply(c(3,2,2,3,1,3,5,6,4), function(x) table[which(table[,1] == x),2] )
to give you the desired output too

Resources