replacing specific elements of a vector - r

I am trying to make a user-defined function below using the R
wrkexpcode.into.month <- function(vec) {
tmp.vec <- vec
tmp.vec[tmp.vec == 0 | tmp.vec == 9] <- NA
tmp.vec[tmp.vec == 1] <- 4
tmp.vec[tmp.vec == 2] <- 13
tmp.vec[tmp.vec == 3] <- 31
tmp.vec[tmp.vec == 4] <- 78
tmp.vec[tmp.vec == 5] <- 174
tmp.vec[tmp.vec == 6] <- 240
return (tmp.vec)
}
but when I execute with a simple command like
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
the result comes like
[1] 31 13 13 31 78 31 174 240 78
but I expect the result like
[1] 31 13 13 31 **4** 31 174 240 78
How can I fix this?

You have to carefully follow the flow of your function, evaluating what the values are. You are expecting 1 to be replaced by 4 based on tmp.vec[tmp.vec == 1] <- 4, however in tmp.vec[tmp.vec == 4] <- 78 later down the road, the 4 is replaced by a 78. This is caused by replacing the values in tmp.vec and using tmp.vec for determining what needs to be replaced. Like #MattewPlourde said, you need to base the replacement on vec:
tmp.vec[vec == 1] <- 4
Although I would simply replace the code by:
wrkexpcode.into.month <- function(vec) {
translation_vector = c('0' = NA, '1' = 4, '2' = 13, '3' = 31,
'4' = 78, '5' = 174, '6' = 240, '9' = NA)
return(translation_vector[as.character(vec)])
}
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
# 3 2 2 3 1 3 5 6 4
# 31 13 13 31 4 31 174 240 78
See also a blogpost I wrote recently about this kind of operation.

It think it will be much easier to use one of the many recode functions that are designed for such purposes instead of hard-coding it. It's just a one-liner then, e.g.
library(likert)
x <- c(3,2,2,3,1,3,5,6,4)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
[1] 31 13 13 31 4 31 174 240 78
And if desired, wrap it into a function, e.g.
wrkexpcode.into.month <- function(x)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
wrkexpcode.into.month(x)
[1] 31 13 13 31 4 31 174 240 78

You could create matrix pointing the input value (column1) to the desired output value (column2)
table=matrix(c(0,1,2,3,4,5,6,9,NA,4,13,31,78,174,240,NA),ncol=2)
And using sapply on the vector c(3,2,2,3,1,3,5,6,4)
sapply(c(3,2,2,3,1,3,5,6,4), function(x) table[which(table[,1] == x),2] )
to give you the desired output too

Related

reducing repetitive tasks in data.table in R

I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?
I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Storing the output from a loop as a list in R

I am running a small loop to randomly assign a list of numbers (1 to 30) to a subset of 4 groups. I would like to store the outputs of the loop (for 4 subsets) as a single line in one variable and use the results elsewhere. I am also getting some warnings, though the output is correctly displayed on the screen.
list = as.vector(c(6, 9, 3, 12)
start <- 1
end <- 6
i <- 1
while(i<=list){
print(sample(start:end, replace=T))
start <- start+list[i]
end <- end + list[i+1]
i <- i+1
}
[1] 3 5 6 1 5 6
[1] 9 13 12 7 11 12 14 11 14
[1] 16 17 17
[1] 28 22 26 21 28 26 22 28 26 30 21 19
Error in start:end : NA/NaN argument
In addition: Warning messages:
1: In while (i <= list) { :
the condition has length > 1 and only the first element will be used
2: In while (i <= list) { :
the condition has length > 1 and only the first element will be used
3: In while (i <= list) { :
the condition has length > 1 and only the first element will be used
4: In while (i <= list) { :
the condition has length > 1 and only the first element will be used
5: In while (i <= list) { :
the condition has length > 1 and only the first element will be used
I am unable to find the reasons for this error. Please help. Thanks.
Works fine using for loop than while loop, no need of sub-setting i variable when we use seq function
list = c(6, 9, 3, 12)
start <- 1
end <- 6
for(i in seq(list)){
if(i <= list[i]){
start <- start+list[i]
end <- end + (list[i]+1)
print(sample(start:end, replace=T))
}
}
[1] 10 8 11 7 11 10 12
[1] 23 17 18 21 22 18 20 21
[1] 25 21 27 23 26 26 23 25 22
[1] 33 32 37 37 35 40 32 37 34 38

Finding local maxima and minima in R

I'm trying to create a function to find a "maxima" and "minima". I have the following data:
y
157
144
80
106
124
46
207
188
190
208
143
170
162
178
155
163
162
149
135
160
149
147
133
146
126
120
151
74
122
145
160
155
173
126
172
93
I have tried this function to find "maxima"
localMaxima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(-.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
maks <- localMaxima(x)
And funtion to find "minima"
localMinima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
mins <- localMinima(x)
And the result is not 100% right
maks = 1 5 7 10 12 14 16 20 24 27 31 33 35
mins = 3 6 8 11 13 15 19 23 26 28 32 34 36
The result should
maks = 5 7 10 12 14 16 20 24 27 31 33 35
mins = 3 6 8 11 13 15 19 23 26 28 32 34
Finding local maxima and minima in R comes close, but doesn't quite fit.
How can I fix this?
Thanks you very much
You could define two functions like the below which produce the vectors you need:
library(data.table)
#shift lags or leads a vector by a certain amount defined as the second argument
#the default is to lag a vector.
#The rationale behind the below code is that each local minimum's adjucent
#values will be greater than itself. The opposite is true for a local
#maximum. I think this is what you are trying to achieve and one way to do
#it is the following code
maximums <- function(x) which(x - shift(x, 1) > 0 & x - shift(x, 1, type='lead') > 0)
minimums <- function(x) which(x - shift(x, 1) < 0 & x - shift(x, 1, type='lead') < 0)
Output:
> maximums(y)
[1] 5 7 10 12 14 16 20 24 27 31 33 35
> minimums(y)
[1] 3 6 8 11 13 15 19 23 26 28 32 34
this is a function i wrote a while back (and it's more general than you need). it finds peaks in sequential data x, where i define a peak as a local maxima with m points either side of it having lower value than it (so bigger m leads to more stringent criteria for peak finding):
find_peaks <- function (x, m = 3){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
so for your case m = 1:
find_peaks(x, m = 1)
#[1] 5 7 10 12 14 16 20 24 27 31 33 35
and for the minima:
find_peaks(-x, m = 1)
#[1] 3 6 8 11 13 15 19 23 26 28 32 34

Integers that are not divisible by several numbers

I am trying to print a vector with the integers between 1 and 100 that are not divisible by 2, 3 and 7 in R.
I tried seq but I am not sure how to continue.
Another option is to use Filter to, well, filter the sequence for any number that meets your condition:
Filter(function(i) { all(i %% c(2,3,7) != 0) }, seq(100))
## [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
Note that while this may (IMO) be the most readable, it's the worst in terms of performance (so far):
UPDATED to take into account rawr's for loop solution:
microbenchmark(
filter={ v1 <- seq(100); Filter(function(i) { all(i %% c(2,3,7) != 0) }, v1) },
reduce={ v1 <- seq(100); v1[!Reduce(`|`,lapply(c(2,3,7), function(x) !(v1 %%x)))] },
rowout={ v1 <- seq(100); v1[rowSums(outer(v1, c(2, 3, 7), "%%") == 0) == 0] },
looopy={ v1 <- seq(100); for (ii in c(2,3,7)) v1 <- v1[-which(v1 %% ii == 0)]; v1 },
times=1000
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## filter 108.280 118.7000 143.88592 126.2155 136.6290 2349.952 1000 c
## reduce 21.552 23.8095 25.91997 24.8150 25.8580 144.067 1000 ab
## rowout 26.075 28.4920 31.11812 29.5350 31.2125 184.225 1000 b
## looopy 14.149 16.0765 18.11806 16.8995 17.8595 160.485 1000 a
To make it fair I added sequence generation to all of them (and, I was doing this to compare relative performance vs actual speed anyway, so the comparison results still work).
Original statement:
"Unsurprisingly, akrun's is optimal :-)"
is now superseded by:
"Unsurprisingly, rawr's is optimal :-)"
Basically you want to compute each of the numbers in 1:100 modulo 2, 3, and 7. You could use outer to perform all the modulo operations in a single vectorized operation, using rowSums to identify the elements in 1:100 that are not perfectly divided by 2, 3, or 7.
v1 <- 1:100
v1[rowSums(outer(v1, c(2, 3, 7), "%%") == 0) == 0]
# [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
We can do this in a loop using lapply using the modulo operator, convert the 0 to TRUE by negating (!), use Reduce with | to find the corresponding list elements that are either TRUE, negate and subset the 'v1'
v1[!Reduce(`|`,lapply(c(2,3,7), function(x) !(v1 %%x)))]
Or instead of looping, this can be also done in a faster way.
v1[!(!v1%%2) + (!v1%%3) + (!v1%%7)]
data
v1 <- seq(100)
The other answers are better, but if you really need to use a for loop, as this question suggests, this could be a possibility:
x <- vector()
n <- 1L
for(i in 1:100){if (i%%2!=0 & i%%3!=0 & i%%7!=0) {x[n] <- i; n <- n+1}}
#> x
# [1] 1 5 11 13 17 19 23 25 29 31 37 41 43 47 53 55 59 61 65 67 71 73 79 83 85 89 95 97
As already mentioned, the other answers posted here are better because they exploit the vectorized capabilities of R. The short code shown here is probably slower than any of the other answers and more complicated to maintain. It is the typical syntax of other programming languages, like C or FORTRAN, applied to R. It works, but it is not the way things should be done.
Rather than using modulo arithmetic explicitly, we can generate the negative modulo sequence easily by counting down. Then for each of the three sequences, we can OR them all together, then drop it into which().
which(as.logical(pmin(rep_len(1:0, 100),
rep_len(2:0, 100),
rep_len(6:0, 100))))
If we want to be a bit less hardcoded, we might use do.call with lapply():
which(as.logical(do.call(pmin, lapply(c(2,3,7)-1, function(x)rep_len(x:0, 100)))))
EDIT:
Here's one way to do it using logicals:
v1 <- logical(100); for (ii in c(2,3,7) -1) v1 <- v1 | rep_len(rep(c(F,T), c(ii,1)), 100) ; which(!v1)
I had the same problem in my class. I assumed the teacher gave me all the information I needed to find the answer and I was correct. This is week one and all that other silly stuff all you other advanced people used has not came up.
I did this though.
r = c(1:100)
which(r %% 3 == 0 & r %% 7 == 0 & r %% 2 == 0)
Use the which function.

Resources